Agent Discoverability: What Your Site Is Missing

23 April 2026 · Tom Cranstoun · 12 min read

AI agents that act on behalf of users, finding services, comparing options, making recommendations, completing transactions, do not discover websites the way search engines do. They look for structured signals at specific locations. If those signals are absent, the site is functionally invisible to that class of agent, regardless of how good its content is.

Most sites are missing most of these signals. Audits of professional sites consistently find that the majority lack proper semantic HTML, most have no llms.txt file, more than half actively block major AI crawlers, and most have missing or partial Schema.org coverage. These patterns appear across organizations with sophisticated digital teams, substantial web budgets, and public commitments to digital excellence; the gap is about awareness rather than resources.

These are the invisible users, invisible for two reasons. They are invisible to site owners: they blend into analytics logs, arrive once, succeed or fail silently, and generate no complaint and no error report. And the interface is invisible to them: they cannot see animations, colors, toast notifications, or loading spinners. They parse only what is explicitly present in the HTML. Every visual cue your design relies on to communicate meaning tells them nothing.

This post diagnoses what the signals are, what the absence of each one costs, and what fixing it involves.

The 5-stage agent journey

Before examining individual layers, it helps to understand what agents are trying to do. When AI agents interact with a website, they follow a predictable journey with five stages:

Discovery: can agents find you? Requires crawlable structure, semantic HTML, server-side rendering.
Extraction: can agents accurately extract your content? Requires fact-level clarity, Schema.org JSON-LD, explicit content architecture.
Compare: can agents understand your offering relative to others? Requires explicit comparison attributes, structured pricing data.
Pricing: can agents understand your costs without error? Requires Schema.org Product/Offer types with unambiguous currency (ISO 4217 codes).
Confidence: can agents complete the user’s goal? Requires explicit form semantics, DOM-reflected state, persistent feedback.

The catastrophic failure principle applies: miss any stage and the entire chain breaks. A site that is discoverable but uncitable is functionally the same as a site that is invisible, the agent cannot recommend it. Each layer described below maps to one or more of these stages.

The crawl layer

Before any content is read, an agent checks whether it is permitted to read it. This is Stage 1, Discovery, and it starts with robots.txt.

Audits show 60% of professional sites block major AI agents. Sites routinely block GPTBot, ClaudeBot, Amazonbot, and other AI crawlers through robots.txt directives or services like Cloudflare. The irony is stark: organizations want AI-mediated recommendations but actively prevent agents from accessing the content they need to make those recommendations.

Many sites block AI crawlers without intending to, typically because they added broad disallow rules to block scrapers and those rules catch legitimate AI user-agent strings too. The result is a site that has actively told AI systems to stay away. If your robots.txt blocks AI crawlers, you are opting out of AI indexing entirely. Zero recommendations. Zero citations. Complete invisibility.

Check your robots.txt and verify which user agents are disallowed. The worst-agent design principle applies here: you cannot detect which agent is visiting, User-Agent strings are spoofable. Design for the worst agent, and you are compatible with all agents.

The inverse problem also exists: no robots.txt at all, which leaves AI systems with no guidance. A minimal robots.txt that explicitly permits reputable AI crawlers is a positive signal, not just the absence of a negative one.

The site description layer

An agent that is permitted to crawl your site still has no structured description of what it will find. llms.txt fills this gap, and 85% of sites have not implemented it.

A site without llms.txt forces AI systems to infer its purpose, structure, and permissions from page content alone. That inference is imprecise. The model may mischaracterise the site’s subject matter, miss important content areas, or apply default permissions that do not match your intent.

llms.txt is a plain text file at your domain root. It describes the site in terms an AI can use: what it is for, what its main sections contain, which pages are most relevant, and what you permit. It takes less than an hour to write for most sites and requires no technical infrastructure beyond the ability to place a file at your domain root.

There is an important caveat. llms.txt is served as a text or markdown MIME type, not HTML. Training-time crawlers (Common Crawl and its derivatives) do not typically ingest non-HTML files. At inference time, agents go straight to relevant pages, they do not fetch a site-level directory first. To close this gap today, publish the same content as an HTML page (for example, /llms.html or /about/for-agents) and include it in your sitemap, so training crawlers ingest it and the guidance enters model knowledge bases.

A site without one is leaving its AI representation to chance. A site with one, plus an HTML equivalent in the sitemap, is providing agents with a briefing document before they start working with the content.

The service description layer

llms.txt describes content. An agent card describes a service.

If your site offers something that agents might want to use on behalf of a user, booking, data retrieval, document processing, commerce, an agent card is one way to make those capabilities findable in agentic workflows. The Agent2Agent (A2A) protocol, a Google-led initiative, defines the format: a JSON file at /.well-known/agent-card.json describing your service’s capabilities, endpoint, and authentication requirements.

A few things worth knowing before you prioritize it. The A2A protocol is a vendor-promoted standard, not yet ratified by an independent body such as IETF or W3C. Adoption outside Google’s agent ecosystem is still growing. A site without an agent card today is in the majority, not lagging. If you are building for a transactional future, adding one is worthwhile groundwork. If you are focussed on getting the foundational layers right first, semantic HTML, Schema.org, llms.txt, those will reach more agents sooner.

For informational sites, this layer is optional. For transactional or service-oriented sites looking to reach Stage 5 (Confidence), an agent card is a logical next step once the foundational layers are solid.

The page structure layer

At the individual page level, agents extract meaning from HTML structure. They rely on semantic elements, <main>, <article>, <nav>, <header>, <section>, <h1> through <h6>, to understand what a page contains and how it is organized.

Most sites audited lack proper semantic HTML. The majority use generic <div> containers with CSS classes for visual hierarchy. Agents parsing served HTML, the static HTML sent from your server before JavaScript executes, cannot distinguish navigation from content from sidebars. The structure that humans see visually does not exist in the HTML.

This is the served HTML versus rendered HTML distinction. Many AI agents, server-side parsers like those behind ChatGPT and Claude, fetch your URL and process raw HTML without executing JavaScript. If your site requires JavaScript to display products, show prices, or render navigation, these agents see nothing. Your carefully crafted user experience is invisible to them.

Even browser-based agents that execute JavaScript need semantic structure. They can see everything humans see, but they parse structure like server-side agents. Visual design cues, color, spacing, animation, do not help agents understand content purpose.

The practical rule: design for the worst-case agent (served HTML, no JavaScript), and you automatically support all agents.

Audit a sample of your pages. Check whether the HTML uses semantic elements correctly, whether heading hierarchy is logical and unbroken, whether the main content area is identifiable as <main>, and whether navigation, sidebars, and footers are correctly labeled. These are the same checks that WCAG accessibility audits perform, the convergence principle in practice.

The structured data layer

Schema.org markup tells machines not just that something is content, but what kind of content it is. An Article is different from a Product, a LocalBusiness, an Event, or a Service. Each type carries specific properties that agents can read and act on.

Most sites audited have missing or partial Schema.org coverage. Structured data exists on some pages but not others. Product pages have pricing Schema.org, but comparison tables lack it. Event pages have dates but not registration URLs. The inconsistent implementation forces agents to guess which pages contain authoritative data.

A page with proper structured pricing metadata answers the question of what something costs in milliseconds at near-zero compute cost. A page without it forces every visiting machine to spend tokens figuring out the price, the currency, and the availability, and to risk getting it wrong. The Danube cruise error, where £2,030 became £203,000 because European decimal formatting was misinterpreted, is not a theoretical risk. It happened.

The six Schema.org types that cover about 90% of what most sites need: Organization/LocalBusiness, Article/BlogPosting, Product/Offer, FAQPage, HowTo, and WebPage/WebSite. Use JSON-LD, it separates structured data from your HTML, making it easier to maintain, simpler to implement, and more reliably parsed.

Common gaps to check: articles without Article markup, product pages without Product and Offer markup, contact pages without LocalBusiness or Organization markup, and FAQ content without FAQPage markup. Each gap is an opportunity for an agent to misunderstand what the page contains.

The accessibility layer

WCAG compliance and agent discoverability are not separate concerns. The convergence principle, that the techniques which make content accessible to disabled users are the same techniques that make it accessible to AI agents, means that accessibility failures are also machine readability failures.

The overlap is not coincidental. Both groups, disabled users and AI agents, lack access to visual design cues. A missing <main> element forces screen reader users to navigate the entire page to find primary content. It forces agents to do the same. Missing alt text blocks both agents and blind users. Visual-only state indicators exclude both agents and keyboard users.

The specific WCAG criteria that map directly to agent discoverability:

WCAG 1.1.1 (Non-text Content), alt text on images. Without it, agents cannot understand visual content.
WCAG 1.3.1 (Info and Relationships), semantic structure. Without it, agents cannot parse page hierarchy.
WCAG 2.4.4 (Link Purpose), meaningful link text. “Click here” tells an agent nothing about destination.
WCAG 4.1.1 (Parsing), valid HTML. Malformed markup breaks machine parsers.

Most sites audited have explicit state missing, form validation errors display as visual color changes, checkout progress shows via CSS-animated steppers, button states indicate loading with spinners. None of this state appears in HTML attributes where agents can read it. State exists visually but not semantically.

A WCAG audit of your site is simultaneously an MX audit. Errors in the accessibility report are errors in your machine experience. They are the same problems. One implementation serves both audiences.

Evaluating agent-readiness scores

A growing number of tools will give your site an agent-readiness score. Two you may encounter are Cloudflare’s isitagentready.com and Fern’s Agent Score, powered by the Agent-Friendly Documentation Spec (afdocs). Both are worth knowing about. Both have a structural limitation worth understanding before you act on their output.

Each tool measures compliance with standards that its creator built. isitagentready checks for Cloudflare infrastructure signals, .well-known endpoints that Cloudflare’s own tooling reads. afdocs checks for the Fern-authored specification. The same site received a score of 33 from one tool and 100 from the other without any changes being made. Neither score was wrong, exactly. They were just measuring different things.

Analysis of real agent traffic logs reinforces the point. None of the .well-known endpoints that isitagentready checks for received requests from coding agents in production traffic, despite the server receiving substantial agent visits. The standards exist. Adoption at scale has not followed yet.

This matters practically because acting on these scores can lead you to invest in vendor-specific infrastructure before the foundational layers are solid. An agent card at /.well-known/agent-card.json is not useful if agents cannot reliably read your served HTML. A Cloudflare MCP server card does not help if your llms.txt is absent.

The order of investment the audits in this post describe is the right one: semantic HTML, schema.org coverage, llms.txt, then service-description protocols as your use case warrants. Third-party scores are useful input. They are not a substitute for an independent audit grounded in what agents actually do with your site.

What this means in practice

A site that has addressed all of these layers, permissive robots.txt, descriptive llms.txt (with HTML equivalent), an agent card for its services, semantic HTML, Schema.org JSON-LD, and WCAG-compliant content, is as visible to AI agents as a well-optimized site is to search engines.

A site that has addressed none of them is invisible to the growing class of agents that act on behalf of users, regardless of how good its content is or how strong its search engine ranking. Unlike humans who persist through bad UX and can be won back, agents provide no analytics visibility and offer no second chance, they route to wherever the content is readable and explicit.

Most of this work is the same structured, semantic, accessible content practice that good web development has always recommended; what is new is the urgency. As agent-mediated discovery becomes a standard part of how people find and use services, the cost of these gaps grows proportionally.

MX: The Handbook sets out the full framework for designing content that serves both human and machine audiences, across all of these layers, from document metadata to site-level discoverability. MX: The Protocols covers the technical specifications, templates, and phased implementation in detail.

Tom Cranstoun is the Machine Experience Authority and founder of the MX community. His book MX: The Handbook is available now. He consults on MX strategy through CogNovaMX Ltd.