Index

Why an MX Audit Pays for Itself

Machines already arrived

Adobe paid $1.9bn this month for Semrush, a customer-experience analytics company whose dashboard tracks how AI agents read your brand. Cloudflare started blocking unrecognised AI crawlers at the edge by default. ChatGPT and Perplexity now answer the queries that used to land users on your homepage; the user never arrives, and you never know what was said about you. Anthropic's Claude can browse the web and quote your pricing in support conversations with prospects you have never heard of.

Most published websites were not built with this in mind. They were built for human visitors. The structure that helps a human skim, the visual hierarchy that signals importance, the carousel that catches the eye, the on-hover detail that opens to reveal the meaning: none of this is visible to a machine. The machine reads the HTML source. If the meaning is in the picture and not the markup, the machine guesses. If the price is in three places and they disagree, the machine picks one and moves on. If the policy is in a PDF without a structure tree, the machine reconstructs the table by vision and quotes whatever its reconstruction produced.

The cost is invisible at first. A search summary you cannot see attributes a competitor's pricing to your brand. A booking agent you have never met tells a customer your office is closed when it is open. A research assistant cites a year-old PDF as your current position because the current PDF carries no canonical-URL declaration that would let it follow the version chain. None of these failures show up in your analytics. None of them generate a support ticket. They surface in conversations the publisher is no longer in the room for. That is the bit that bothers me most: you cannot fix what you cannot see.

What the audit actually checks

An MX audit reads a published site the way an AI agent reads it: through the served HTML, the structured data, the discovery files, and the metadata that travels with the document. The audit reports specifically where the machine-reader signal is missing or contradicted, page by page. It is not a generic accessibility checklist or a list of SEO best practices. It is a per-page list of defects with the severity an engineering team can prioritise.

The full check covers the three perspectives a publisher needs to satisfy at once: the human-experience layer (UX, accessibility, performance), the structural layer (the HTML the agent actually reads), and the MX appropriateness layer (the governance metadata that tells the agent what it may do with the content). Each perspective scores the page on dimensions that compose into an overall agent-readiness score, and each finding carries a verification command and a captured output so the engineering team can reproduce the failure on demand.

What the audit reports on, in detail: served HTML quality and the gap between served and rendered DOM (what an agent without a JavaScript runtime sees versus what a browser sees); structured-data coverage and consistency (Schema.org JSON-LD presence, contradictions between Schema and on-page text); MX governance fields (status, audience, content-policy, license, content-state, and the discovery and lifecycle fields covered in the recent core proposal); discovery files (sitemap.xml, robots.txt, llms.txt, agent-card.json) and whether the URLs they declare actually return 200 from the live host; agent access by user-agent (which AI clients are blocked at the edge and what they receive instead); content consistency across pages (entity-level cross-references where the same product, person, or policy appears on multiple URLs); PDF accessibility under ISO 14289-1, the EAA baseline; and the "Div Soup" signal, which detects pages where the visible structure is a flat sequence of unlabelled containers that no agent can navigate semantically.

The output is a single PDF report with the prioritised defects, plus the raw machine-readable data (CSVs and JSONs) so an engineering team can ingest the findings into its own tooling.

Three failure modes I find on every site

Every audit I run surfaces variants of the same three patterns. I have never run a clean one.

The first is visible-but-invisible content. Information that is present on the rendered page but absent from the served HTML. Pricing tables drawn with CSS Grid where each cell is a styled <div> with no semantic role. Product specifications inside accordion components that load on click via JavaScript. Hero images carrying critical text in pixels rather than markup. Each of these renders correctly to a human but is invisible to an agent without a full browser runtime, which most agents do not run because the cost-per-read is prohibitive at scale. The agent reads what your engineers shipped, not what your designers see.

The second is contradictory truth. Structured data that disagrees with on-page text. JSON-LD declares the product is in stock; the visible body says "temporarily unavailable". Schema.org reviews show 4.8 stars; the visible reviews show 3.2. The price field has £19.99; the cart shows £24.99 plus VAT. An agent reading both layers picks one (usually the structured data, because it is easier to parse) and acts on it. The publisher then has a customer who was promised something the publisher's own visible content contradicts. The fix is mechanical: match the two layers. The audit finds where they disagree.

The third is missing governance. The page carries content but no machine-readable instructions about what may be done with it. No license declaration, so an agent assumes the most permissive interpretation. No canonical-URL claim, so an outdated copy that has reached the agent's cache is quoted as if it were current. No content-policy declaration, so the agent extracts and summarises freely. No training-data policy, so the artefact ends up in the next training corpus despite the publisher's preference. None of these are visible failures; all of them are leakage points where the publisher's terms are silently overridden by the consumer's defaults.

EAA multiplier

For organisations with European customers, there is a regulatory layer on top. The European Accessibility Act, Directive 2019/882, came into force on 28 June 2025. Public-facing PDFs, e-books, banking applications, ticket machines, and digital content from in-scope businesses must now meet the relevant accessibility standard. For PDFs the standard is ISO 14289-1, which mandates a structure tree, marked content, declared reading order, alternative text, and a conformance claim.

The structure that satisfies the law is the same structure that lets an AI agent read the document without falling back to vision-based reconstruction. The disability case justifies the work; the machine case multiplies the return. Compliance with the EAA is compliance with the machine experience. The work to satisfy the human auditor is the same work that satisfies the agent reading the document next year.

The audit's PDF accessibility check reports per-document conformance against the ISO baseline: which PDFs are tagged, which carry the Level 2 XMP declaration, which would fail an EAA enforcement audit if it landed tomorrow. The fix path is clear: regenerate the PDF through a pipeline that produces a tagged output, decline the version that fails the gate from being deployed.

What you get

Each audit delivers four artefacts:

  • A client-facing PDF report with prioritised findings, severity ratings, and specific page-level fixes. The report is itself an EAA-conformant tagged PDF, so the artefact you receive is an example of the standard the audit recommends.
  • The raw machine-readable data: CSV files for every dimension (accessibility issues, image optimisation, link analysis, marker reachability, Pa11y findings, pages audited, structured-data findings) plus JSON sidecars for the verification trail and the LLM-judgment output. You can ingest these into your own tooling without re-parsing the PDF.
  • A verification report listing every claim in the human-readable text alongside the source it was derived from. Every numeric or behavioural claim in the audit is traceable to a specific CSV row, JSON field, or cached HTML extract. No hand-waving; no "the audit found" without a citation.
  • A two-pass quality gate, mechanical and editorial. The mechanical gate verifies every fact in the report against the underlying data. The editorial gate, called the "fierce critic" pass, looks for the kinds of failure mode that defeat the value of the audit even when the facts are right: leaked boilerplate, uncited industry claims, internal contradictions, scope overreach. Both gates must pass before the report ships.

Where the audit pays for itself

The audit's value is not the report. The report is the proof. The value is the work the audit makes legible: which specific changes, in which specific files, will cause your content to be read correctly by the next agent that lands on it. Three vectors return the cost of the audit.

The first is reduced inference cost across every reader. An agent that can navigate a tagged PDF by structure does an order of magnitude less compute than one reconstructing the document by vision. An agent that reads structured data does no work to extract the price. The compute differential compounds across every machine read of the document for the rest of its life. Publishers do not pay this compute bill directly, but they pay the consequence: agents that can read your content cheaply read it more often, cite it more often, and recommend it more often than agents that find it expensive to read. The publishers whose content is cheap to read win the citation lottery. I am not certain anyone has measured this in pounds yet, but the direction is unambiguous.

The second is reduced hallucination, which is reduced loss. Agents that cannot reach the truth do not say "I don't know". They guess. The guesses become citations in research summaries, clauses in generated contracts, answers in customer support conversations the publisher is not in the room for. Each of those errors is a small loss the publisher absorbs without ever seeing it. Each fix to a structure-tree or schema contradiction removes a class of error from a class of conversations the publisher will never witness.

The third is reduced regulatory exposure. The EAA enforcement window is open. The fines are non-trivial. The audit's PDF accessibility section identifies which documents would fail an enforcement check today, which is more useful than discovering it from a regulator's letter.

Why MX rather than ad-hoc fixes

One reasonable question is why this kind of work needs a framework at all. Could a publisher not just fix the headings, the schema, the PDF tags, the agent-blocking, item by item, as they come up?

They could, and many do. The cost is paid in the discipline gap that follows: every team that touches the site adds new content, new components, new pages, new pipelines. Without a documented standard for what a machine-readable page or PDF looks like in this organisation, every new addition becomes a fresh decision. Some additions get the structure right, others do not, and over time the corpus drifts back into the same condition that triggered the first audit. The audit becomes a recurring quarterly cost rather than a one-shot fix.

MX is the discipline layer that prevents the drift. It is a shared vocabulary (the field dictionary), a shared profile (what fields are required at which conformance level), a shared check (the audit and its compliance gates), and a shared reference (the books, the appendices, the cookbook recipes). Once the team has the discipline, every new page, every new PDF, every new feature is built MX-aware on the first commit, not after the next audit catches it.

The framework is open. The standard is published. The dictionary, the audit tool, the cookbook, the books are all in the open. The audit is the entry point. The discipline is what keeps the gain.

Next step

If your organisation publishes content that machines now read at scale (and effectively all organisations are in this category in 2026), the audit is the cheapest way to find out where you stand. It costs less than a single integration project and produces a fix list your engineering team can act on directly, in specific files, against named issues. No vague quarter-long programme of work; no consultants returning every month with new framing. The fixes are specific or they are not fixes.

The result is a published corpus that machines can read correctly, an EAA-compliant PDF estate that survives a regulatory check, and a discipline layer that prevents the next drift. The audit is the entry point. The work is upstream.

Back to Top