Index

The Markdown Trap: What AI Agents Lose When They Ask for the Wrong Format

In February 2026, Cloudflare shipped a feature called Markdown for Agents. The pitch is token efficiency: convert HTML to Markdown at the CDN edge so AI agents receive smaller payloads. Cloudflare cites approximately 80% token reduction. The logic is seductive, and for a certain class of content it is correct. For another class — the structured, governed web content that organisations are actively building for the machine-readable future — it is exactly backwards.

I fetched a single page twice to see what the difference actually looks like.

The experiment

The page was https://mx.allabout.network/books/handbook.html, the landing page for MX: The Handbook. I requested it first as a browser would, then with Accept: text/markdown as an AI agent might if it defaulted to Markdown for all requests. The HTML response was 32,236 bytes. The Markdown response was 21,890 bytes. That is 32% smaller, not 80%, though the exact figure varies by page composition.

The question is not the size. The question is what the 10,346 bytes contained.

What disappeared

The bytes were not padding. They were not boilerplate navigation a human might safely skip. They were almost entirely structured metadata.

The governance layer. The HTML page carries MX carrier tags in its <head>: mx:content-policy: extract-with-attribution, mx:attribution: required, mx:status: active, mx:contentType: landing-page. These are the machine-readable declaration of what an agent may do with this material. An agent receiving the Markdown version never sees these fields. It reads the text, believes it has read the page, and proceeds without knowing it was required to attribute. The governance layer has been silently removed.

Open Graph metadata. og:type: book tells an agent this resource describes a Book. og:locale: en_GB establishes locale. Image dimensions are declared. All of this disappears in the Markdown conversion. The type declaration matters: an agent that knows it is reading a Book resource can query Schema.org for the Book vocabulary and process the content accordingly. An agent reading plain text has no such signal.

Discovery links. The page carries <link rel="llms-txt">, <link rel="sitemap">, and <link rel="ai-txt">. These are the entry points an agent follows to find the AI directory, the structured content inventory, and the crawl guidance for the site. When an agent reads the Markdown version, those links do not exist. The discovery chain is severed at the first request. The agent cannot find what the site deliberately made available, because the page it read had already had the signposts removed.

The JSON-LD false comfort. The page's JSON-LD structured data does appear in the Markdown output, but as a fenced code block. An HTML parser encountering <script type="application/ld+json"> knows it has found authoritative structured data and extracts the block accordingly. A language model encountering ```json in a Markdown document knows it has found a code sample. It may read the contents. It will not automatically treat them as a structured data description of the resource. The JSON-LD's presence looks like preservation. It is not. The structured data signal has been demoted from machine-processable metadata to human-readable code illustration. The difference is the difference between a signpost and a photograph of a signpost.

Machine-readable temporal data. The HTML uses <time datetime="2026-04">April 2026</time>. The Markdown renders this as the plain text "April 2026". The ISO 8601 date is gone. An agent comparing publication dates or sorting by recency is now working from human-formatted text rather than a parseable timestamp.

Language and direction. lang="en-GB" and dir="ltr" establish the content's language and text direction at the document level. An agent performing translation, locale-aware ranking, or language detection loses this signal and must infer language from the prose — a solvable problem, but an unnecessary one when the answer was there.

The footer. The copyright statement is completely absent from the Markdown output. For a page carrying attribution requirements, this is not a minor omission.

The token efficiency argument examined

The 32% reduction sounds compelling. It looks less compelling when you examine what was removed. Those 10,346 bytes were almost entirely structured metadata: governance tags, Open Graph declarations, discovery links, semantic structure, machine-readable dates. The prose — the words a human reads — was almost fully preserved.

An agent optimising for token efficiency by requesting Markdown has made a trade: slightly fewer tokens on the payload, in exchange for losing the metadata that tells it what kind of resource this is, what it may do with the content, how to attribute it, and where to go next. This is not efficiency. It is discarding the instrument panel to reduce the weight of the aircraft.

The efficiency argument applies cleanly to a different category of content: plain text documents where the content is the message. A llms.txt file, a plain article with no governance metadata, a README. For these, Markdown is often the right format. The content has no structured metadata layer to lose. The prose is what the agent came for, and Markdown delivers prose efficiently.

The category error is applying a single efficiency heuristic to all content types — including requests for pages where the metadata is not ancillary to the content but is the point of the page.

The silent failure mode

This is the part that concerns me most, because it leaves no obvious trace.

When an agent requests a governed page in Markdown format and receives stripped text, it does not receive an error. The request succeeds. The agent gets content. The agent processes content. The agent produces output. Nowhere in that chain is there a signal that the attribution requirement was present and ignored, that the content policy was specified and bypassed, that the discovery links existed and were cut. The agent did not misbehave. It was given text and it read text. The damage was done upstream, at the format selection step.

The MX governance framework places machine-readable policy onto web pages specifically so that agents can read that policy and act accordingly. An agent framework that systematically strips those signals by requesting Markdown has undermined the entire governance layer without knowing it did so. The page author worked to specify what agents may do. The agent framework worked to read pages efficiently. Both worked correctly within their own logic. The combination produced a silent failure.

Silent failures are the hardest kind to fix, because no alarm sounds. The governance layer is absent, but the content appears to have been delivered. The attribution requirement is there in the HTML, visible to anyone who views the page source, honoured by agents that request HTML — and invisible to agents that request Markdown.

The opposite failure

The Cloudflare mechanism strips content from what AI agents receive. A different class of tooling runs in the opposite direction.

Adobe LLM Optimizer's "Optimize at Edge" feature detects AI agent User-Agents at the CDN edge and routes those requests to a separate optimisation backend. What comes back is not the original page. It is the original page plus AI-generated FAQs, page-level summaries, and rewritten sections produced by Adobe's backend — none of which was written by the publishing organisation. Human visitors continue to receive the original unmodified content. The response header x-edgeoptimize-request-id confirms when the optimised version was applied.

This is cloaking. The AI agent is reading a version of the page that no human at the publishing organisation authored or approved. The structured data on the original page — the JSON-LD declarations, the robots directives, the canonical URL — was written for the original content. It does not cover injected FAQs generated automatically by a third-party backend. An AI agent reading the augmented page has no way to distinguish which content is original and which is synthetic, and the page's machine-readable signals offer no guidance on the injected material.

The citation loop this creates is worth examining. Adobe measures brand visibility by tracking how often AI systems cite pages that have been through its optimiser. When an AI system cites an injected FAQ as if it were original content from the publisher, that registers as a successful outcome. The publisher is measured as visible for claims Adobe's backend generated. Whether those claims are accurate, authorised, or consistent with the rest of the site's content is outside the measurement.

The structural problem is symmetric. Cloudflare removes what the publisher put there. Adobe adds what the publisher did not put there. In both cases, the AI agent reads a version of the page that differs from what the publisher authored, and the publisher's structured signals do not accurately describe what the agent received.

What publishers and agent developers should do

Publishers running MX-governed pages should configure their CDN to serve HTML for those pages regardless of the Accept header. Cloudflare's configuration supports URL pattern rules. A publisher can write a rule that excludes specific URL patterns from Markdown conversion — book pages, product pages, pages carrying mx:content-policy headers. The rule is narrow. The service continues to operate for pages where it adds genuine value. The governed pages are served as HTML and their metadata survives the journey to the agent.

Agent developers should treat Accept: text/markdown as a task-scoped decision, not a default. The right default for any request where governance metadata, discovery links, or structured data might be present is Accept: text/html. The agent can then parse the HTML, extract the JSON-LD, read the <meta> tags, follow the discovery links, and process the governance signals. Markdown can be requested explicitly when the task context makes clear that prose alone is sufficient.

An agent that reads HTML and processes its metadata correctly is doing less work overall, because it is not subsequently discovering — through attribution complaints, failed discovery chains, or missing locale signals — that it read the page without the information it needed. The cost of reading HTML is lower than the cost of recovering from having read Markdown when HTML was required.

The standard is not the problem

HTTP content negotiation is correct. The Accept header mechanism is how the web has always allowed clients to declare format preferences. Nothing here argues against content negotiation. The argument is against a specific miscalibration: agents defaulting to Markdown for all requests, including requests for pages built on the assumption that their metadata will be read.

Text efficiency is a genuine concern. It should not be purchased at the cost of the signals that tell machines what the text means and what they may do with it. The web has spent thirty years building the infrastructure to carry those signals. An agent that discards them to save tokens has not found an efficiency. It has opted out of the governance layer without knowing it did so.

The 10,346 bytes that distinguish the HTML from the Markdown version of that book page are not waste. They are the governance layer. Discarding them to save tokens is building on sand.

This post draws on Chapter 22 of MX: The Protocols, "Content Negotiation and the Markdown Trap", which covers the full technical detail including publisher configuration and the correct scope for Markdown requests.

Back to Top