Tagged PDFs Are MX

29 April 2026 · 7 min read

MX is not just HTML

A common shorthand for Machine Experience is "structured HTML for AI agents." That shorthand is convenient and incomplete. The web is not a single carrier. A modern publisher ships HTML pages, PDF reports, DOCX contracts, MP4 demos, audio interviews, CSV datasets, and ICS calendar feeds. Every one of those carriers either gives a machine the structure it needs to act, or it forces the machine to reconstruct that structure by inference. MX is the discipline of choosing the first option in every format you publish, not just the one that renders in a browser.

The European Accessibility Act, Directive (EU) 2019/882, came into force on 28 June 2025. It targets human disability accommodation. Its scope is far wider than HTML: PDFs, e-books, ATMs, ticket machines, banking apps. For the PDF case it leans on ISO 14289-1, the PDF/UA standard, which mandates that every PDF carry a structure tree, marked content, declared reading order, language tags, and metadata. The law speaks the language of disability inclusion. The artefact it produces, by happy convergence, is the same artefact a machine reader needs.

What a tagged PDF actually is

Open a PDF in a viewer that shows the document tree. An untagged PDF is a sequence of positioned glyphs and pictures. The viewer can render it because rendering only requires geometry: this character at this point, that picture at that rectangle. Nothing in an untagged PDF declares that a particular glyph cluster is a heading, that one block of text is a paragraph and another is a caption, that this row of cells belongs to that table, that the reading order goes top of left column then top of right column rather than line by line straight across.

A tagged PDF carries an additional structure inside the file: a tree of <Document>, <Sect>, <P>, <H1> through <H6>, <L> for lists, <LI> for items, <Table> with proper <TR>, <TH>, <TD> nesting, <Figure> with alternative text, <Caption> bound to the figure or table it describes. The /MarkInfo dictionary declares that the content is marked. The XMP packet declares the conformance level via pdfuaid:part. Every glyph in the visible page belongs to a node in this tree. The visible page is unchanged. The structure is added alongside.

The convergence: human accessibility equals machine readability

A screen-reader user navigating a tagged PDF jumps between headings, lists, and tables using the structure tree. The viewer reads the marked content in declared reading order. The user hears "level two heading: Methods" rather than a slurred run of letters across a column boundary.

An AI agent ingesting the same tagged PDF reads exactly the same tree. It locates sections by heading level. It walks tables row by row knowing which cell is a header and which is data. It pairs figures with their captions. It honours the declared reading order rather than guessing across multi-column layouts. The cognitive work that the screen reader does for the human and the cognitive work that the agent does for the machine are the same work, performed against the same metadata, producing the same correct answer.

The convergence is not coincidence. It is the consequence of treating "what does this content mean" as a separate question from "how does this content render." Once you separate the two, the same answer to the meaning question serves every consumer that needs an answer to the meaning question: people who cannot see the rendering, people on small screens, people in noisy environments, agents reading the document programmatically, search engines indexing the content, downstream pipelines extracting facts. Render is for one audience. Meaning is for everyone.

The cost of an untagged PDF

An agent confronted with an untagged PDF has two options. It can fall back to optical-character-recognition style reconstruction: rasterise the page, run vision, segment into regions, classify each region as heading or paragraph or caption, group regions into a logical reading order, attempt to recover table rows and columns. This is expensive in compute, brittle on multi-column layouts, and frequently wrong. Or it can extract the raw text stream and treat the document as flat prose, losing every structural signal.

Both fallbacks introduce errors. Heading levels are guessed, often inverted. Tables collapse into ribbon text where adjacent cells run together. Captions detach from their figures. Footnotes interleave with body text. Reading order leaks across columns and breaks sentences in half. The agent, having reconstructed something approximating the document, then tries to answer questions against that reconstruction. The errors compound: a wrong heading level produces a wrong section boundary which produces a wrong scope for a query which produces a wrong answer.

The user reading the agent's answer cannot see the reconstruction step. They see a confident statement that may or may not reflect the source. When the agent has visibly hallucinated, the failure is at least visible. When the agent has confidently misread an untagged PDF, the failure looks identical to a correct reading until the user goes back to the source and checks. Many users do not check.

Inference, hallucination, and energy

Reconstruction has three costs and they all compound across an industry of trillions of agent reads per year.

The first is inference cost. Vision-based document reconstruction runs full frame analysis over every page; tagged ingestion is a structured tree walk. The compute differential is one to two orders of magnitude depending on document complexity. Multiply by every agent reading every PDF on every site. Tagged carriers are a measurable energy reduction at industry scale.

The second is hallucination rate. An agent that has misread a table will quote made-up numbers from it. An agent that has interleaved footnotes with body text will attribute body claims to footnote authors and footnote claims to body authors. An agent that has lost the reading order will summarise the right-hand column when asked about the left. Tagged source removes the reconstruction step that introduces these errors. The hallucination is not eliminated, but the structural class of hallucination is.

The third is downstream cost. A misread answer becomes a citation in a research summary, a clause in a generated contract, a row in a generated dataset. The error propagates outward through the chain of agents that read the first agent's output. Catching it at the point where the structure was first lost, rather than at the point where the propagated error finally surfaces, is the only economically defensible place to fix it.

EAA compliance, viewed through this lens, is not an accessibility tax. It is a compute, accuracy, and energy investment that pays back across every machine read of the document for the rest of its life. The disability case justifies the work. The machine case multiplies the return.

Beyond PDF: every carrier needs a structure tree

The PDF case generalises. Every non-HTML carrier has an analogous structure decision and an analogous standard.

DOCX carries its structure in the OOXML schema: paragraphs marked with style names, headings with outline levels, tables with row and cell roles. Word writes this by default; export pipelines often strip it. The mitigation is to publish DOCX with styles preserved rather than flattened to direct formatting.

EPUB inherits HTML semantics inside a spine of XHTML files plus a navigation document declaring the reading order. EPUB Accessibility 1.1 (the W3C-recommended specification) demands the same heading hierarchy, alternative text, and declared language that HTML accessibility demands. A non-conforming EPUB looks fine in a reader and reads as flat prose to an agent.

Audio and video carriers need transcripts and captions, and increasingly need WebVTT cues with declared roles for speakers, sounds, and chapter boundaries. The transcript is the structural carrier. An agent asked "what was said about X around minute thirty" cannot answer from the audio alone unless the audio has been transcribed and the transcript is reachable.

CSV datasets need column-name headers and a published schema. CSVW, a W3C recommendation, lets a CSV declare its column types, units, primary keys, and relationships in a JSON-LD descriptor. An agent ingesting an untyped CSV guesses column meanings. An agent ingesting a CSVW-described CSV reads the schema and acts correctly.

The pattern is the same across every format. Render is for one audience. Meaning is a separate layer that has to be added deliberately, in the format's native idiom, every time. MX is the discipline that says the meaning layer is mandatory in every carrier you publish, not optional in some and default in others.

What publishers should do

The first action is to audit every published PDF on the site for tagging. Open each one in a tool that can show the structure tree, or run an automated check against the PDF/UA conformance criteria. Any document that is not tagged needs to be regenerated from its source through a pipeline that produces a tagged output. For pandoc users, headless Chrome with --export-tagged-pdf reads HTML and emits a tagged PDF whose structure tree comes from the HTML accessibility tree, which means investments in HTML semantic correctness flow directly into the PDF output.

The second action is to declare conformance explicitly. A tagged PDF without the pdfuaid:part XMP claim is conformant in fact but not in declaration; verifiers and audit pipelines that key on the claim will report it as Level 1 only. The XMP property is small and the cost of writing it is negligible.

The third action is to extend the discipline beyond PDF. Audit DOCX exports for preserved styles. Audit EPUB packages for navigation documents. Audit video pages for transcripts and captions. Audit CSV downloads for headers and a schema descriptor. Make the meaning layer a publishing requirement, not a publishing afterthought.

The fourth action is to put the audit in front of the publish step rather than after it. A pre-deploy gate that fails the build when an untagged PDF would ship costs a few seconds of CI time and prevents the document from reaching the public corpus where every machine read of it would compound the original omission.

CogNovaMX follows the standard

This is not advice given from a distance. CogNovaMX, trading name of Digital Domain Technologies Ltd, publishes its own books, white papers, and audit reports to the ISO 14289-1 (PDF/UA) baseline. Every public PDF on mx.allabout.network carries a structure tree, marked content, declared reading order, alternative text on figures, and a Level 2 pdfuaid:part XMP claim in its metadata packet. The publishing pipeline runs an automated tagging gate before deploy, so an untagged PDF cannot reach the public corpus by accident.

Compliance with the European Accessibility Act is the floor we hold ourselves to before we recommend it to clients. The same audit suite that we sell to organisations we run on our own site, every release, on every PDF we ship.

What else lives in the metadata: provenance, lifecycle, agent affordances

The Level 2 conformance claim and the structure tree are the load-bearing pieces of EAA compliance. The XMP packet that carries the conformance claim has room for considerably more, and an agent that has just opened a PDF is asking more questions than "is this tagged".

The questions sort into four groups.

Identity and provenance. Where did this PDF come from, and is the version I have the current one? A canonical URL declared in the XMP tells an agent receiving the artefact via email or Slack where the official copy lives. A source repository and commit SHA tell it the precise build the document was produced from. Supersedes and superseded-by links carry the version chain so an agent reading a year-old contract can follow the pointer to its replacement before quoting from it. Cryptographic signing, the work the Reginald project does, closes the spoofing problem on top.

Recency and lifecycle. Is this still the truth? An expiry date marks content whose validity ends on a known date: pricing PDFs, SLAs, compliance reports, time-bound regulatory text. A review-by date is softer; it says the document is scheduled for editorial review, not retirement. A correction-SLA tells the consumer how fast errors will be fixed when found. An agent indexing a corpus can prune stale content from its working set with one comparison rather than a content read.

Action affordances. What may I do with this, and what should I do next? A machine-readable license URI lets an agent decide reuse without parsing prose. Reuse-terms expand on the licence to cover edge cases the licence text does not address: training-data inclusion, summarisation, derivative work. Agent-instructions carries an explicit message to AI consumers ("cite as X", "summarise but link back", "do not reproduce verbatim"). Related-docs gives the agent a curated reading list to fetch for context. For documents describing a service or dataset, an API endpoint or data endpoint sends the agent to the operational entry point rather than leaving it to re-derive the URL from prose.

Semantics and structure. What is this about, in machine-resolvable terms? A summary is a one-to-two-sentence machine-summary that lets an agent decide if the artefact is relevant before reading the body. Topic identifiers, given as Wikidata QIDs or Schema.org Concept URLs, lift "tags" from free text to a stable, queryable taxonomy. Named-entity identifiers (Wikidata for people and organisations, ORCID for authors, domain names for organisations) let a corpus indexer build an entity graph from metadata alone, without entity extraction. A speakable summary is the voice-friendly version of the summary, suitable for an assistant to read aloud when the user asks "what is this about". Conforms-to lists the standards the document declares conformance to (PDF/UA-1, EAA, WCAG 2.1, MX Core Level 3) so the agent can read one field and know which contracts the artefact claims.

And a fifth group, often overlooked: negative space. What should an agent not do with this artefact? A training-data policy is the embedded equivalent of the robots.txt training-corpus directive, surviving copy and syndication. A no-LLM-reprocess flag asks consumers to quote rather than rewrite, the right setting for legal text and official records. A do-not-index flag is the embedded analogue of robots.txt noindex, useful for documents that are technically reachable but should not appear in public search.

None of these need to be invented. Most have analogues in Dublin Core, Schema.org, and the IETF metadata vocabularies. The contribution MX makes is consolidating them into one namespace that agents read once and rendering them in the carrier-native idiom every time the document is emitted, so they survive the copying and reformatting that normally strips them away. The XMP packet of a tagged PDF is one rendering. Page-level <meta name="mx:..."> tags are another. JSON-LD on the canonical URL is a third. Each rendering carries the same governance signals; together they make the document legible to the next agent that has to act on it.

Conclusion

MX has been described, fairly, as the practice of treating machines as a first-class audience for structured HTML. The description is correct as far as it goes. It is also smaller than the practice itself.

The web has always been a multi-carrier medium. People download contracts as PDFs, datasets as CSVs, podcasts as MP3s, slide decks as DOCX, code as Markdown. Each carrier has its own native structure idiom. Each idiom either declares meaning or hides it. Where the meaning is declared, machines do less work, hallucinate less often, and consume less energy. Where it is hidden, every read pays the cost again.

The European Accessibility Act, by mandating structure for human-disability reasons, has set up a regulatory tailwind that aligns with the machine readability case exactly. Compliance with the law is compliance with the machine experience. The work to satisfy the human auditor is the same work that satisfies the agent reading the document next year.

Treat every carrier as MX. Publish the structure that the standard for that carrier mandates. The compounding return on a few seconds of pipeline time at publish accrues across every machine read for the rest of the document's life.