Index

What AI Crawlers See When They Can't Run Your JavaScript

In the first two posts of this series I kept arriving at the same fix and then walking past it. The Common Crawl post said the content that survives a training pipeline is the content that is actually in your HTML. The robots.txt post said only Googlebot renders JavaScript, and every other major AI crawler fetches the raw HTML and processes that, and nothing more. Both ended on a version of the same sentence: if it is not in the raw HTML response, it does not exist for the machine.

This post is about that sentence. It is the one that decides whether any of the rest matters.

It closes a three-part series on how AI systems read the web: Your Site Is Already Training AI Models covered the archive most models train on, What Most robots.txt Guides Get Wrong About AI Crawlers covered the live crawlers and what robots.txt controls, and this one covers what those crawlers actually read when they arrive.

The Asymmetry That Decides Everything

Googlebot renders JavaScript. It loads your page, runs your scripts, waits for the DOM to settle, and indexes what a person would see. As far as I know, it is the only major crawler that does.

GPTBot, ClaudeBot, CCBot, OAI-SearchBot, PerplexityBot: none of them run your JavaScript. They make an HTTP request, they get back whatever your server sends in that first response, and that is the page. No script execution, no waiting, no second pass. The raw HTML is not a step on the way to your page. For these crawlers, it is your page.

So the question that matters is simple, and most teams have never asked it: what is in your HTML before a single line of JavaScript runs?

See Your Site the Way a Crawler Does

You do not need special tools. You need to look at the response your server actually sends, not the page your browser assembles from it.

The fastest check is curl:

curl -sL https://example.com/your-page | grep -iE '<h1|<title|application/ld\+json'

If your headline, your title, and your structured data come back, a crawler can see them. If that command returns an empty <title>, no <h1>, and no JSON-LD, then neither can the crawler, no matter how complete the page looks in your browser.

The browser equivalent is View Source, not Inspect. Inspect shows the rendered DOM after JavaScript has run; View Source shows the raw response. If your content is not in View Source, you are looking at what every non-rendering crawler gets.

The starkest test of all: turn JavaScript off in your browser and reload. What remains is, near enough, what GPTBot and ClaudeBot read.

Where the Content Goes Missing

A handful of common patterns hide content from machines that do not render. Each is invisible in normal use, because your browser papers over all of them.

The client-side-rendered shell. A single-page app often ships HTML that is almost empty:

<body>
  <div id="root"></div>
  <script src="/static/bundle.js"></script>
</body>

The text, the headings, the structured data, all of it arrives later, written into that <div> by JavaScript. A person sees a full page. A crawler sees an empty box and a script it will not run.

Here is the same page when the server renders it:

<body>
  <main>
    <h1>Analytics for e-commerce teams</h1>
    <p>A self-contained dashboard that needs no SQL knowledge.</p>
    <script type="application/ld+json">
    { "@context": "https://schema.org", "@type": "Product", "name": "Widget Pro" }
    </script>
  </main>
</body>

Same app, same framework, different output. The second version carries its meaning in the first response.

JSON-LD injected by JavaScript. This one catches careful teams. You added complete structured data, you tested it, it passed. But a tag manager or a client-side schema plugin injected it after load. Googlebot renders, so Googlebot sees it; every AI crawler fetches raw HTML, so none of them do. Your structured data is real for search and absent for AI. The Common Crawl post makes the case for complete JSON-LD; this is the catch that quietly cancels it.

Client-set metadata. Title, meta description, canonical link, and Open Graph tags written by JavaScript after load have the same problem. The crawler reads the server's version, which is too often a generic placeholder.

Lazy-loaded and infinite-scroll content. Anything that loads on scroll, on click, or by a fetch after the page settles is content the crawler never triggers. It does not scroll. It does not click. It reads the first response and moves on.

What Each Rendering Choice Means for AI

The fix has a name, but it helps to see where each common approach lands.

Approach What the crawler gets AI-readable?
Client-side rendering (CSR) An empty shell plus a script bundle No
Server-side rendering (SSR) Complete HTML in the first response Yes
Static site generation (SSG) Complete HTML, pre-built Yes
SSR with hydration Complete HTML, then JS enhances it Yes, if the server HTML is complete
Streaming or incremental SSR Complete HTML, sent in pieces Yes, once the meaningful HTML is in the response
Dynamic rendering (bot-only prerender) Pre-rendered HTML, but only for detected bots Works, with caveats below

The headline is the simple one. If the meaningful content and metadata are in the HTML the server sends, the approach works for AI. The frameworks people reach for, Next.js, Nuxt, Astro, SvelteKit, Remix, all offer an SSR or static mode that produces complete first-response HTML. The framework is not the point. The output is. A static site built by hand passes this test as easily as the newest meta-framework.

Do Not Solve It by Cloaking

There is a tempting shortcut: detect the crawler by its user-agent and serve it a special pre-rendered version while humans keep the client-side app. This is dynamic rendering, and Google describes it as a stopgap, not a recommended long-term setup.

Two problems. First, it is fragile. You are maintaining a second rendering path and a bot list that goes stale the moment a new crawler appears, and the robots.txt post showed how many there now are. Second, it shades into cloaking: serving materially different content to bots than to people, which crosses from workaround into something search guidelines treat as a violation. The line is whether the bot and the human get the same content. Keep them the same and you are fine. Let them drift and you are exposed.

Server-side rendering sidesteps the whole problem. Every crawler, every person, every screen reader gets the same complete HTML. There is no second path to maintain and nothing to keep in sync. The honest fix is also the simpler one.

The Accessibility Overlap Is Not a Coincidence

A page that works without JavaScript is a page that works for a screen reader, for a slow connection, for a browser extension, and for an AI agent. These are the same property seen from different angles: the meaning is in the markup, not locked behind script execution. The first post noted that Google's new Agentic Browsing checks put accessibility-tree integrity alongside llms.txt. This is why. What makes a page legible to an agent is, structurally, what makes it legible to a person using assistive technology. Build for one and you have built for both.

The Test That Settles It

You do not need to audit your whole stack. You need one answer: are the content and metadata you care about present in the raw HTTP response, before any JavaScript runs?

Fetch the page with curl. Read the source. Turn JavaScript off and reload. If your headline, your body text, your structured data, and your meta tags are all there, you are readable. If they appear only after the scripts run, every AI crawler in this series is reading a hollow version of your page, and so is everything trained on it.

Where MX Fits

MX is metadata that travels in the carrier: structured data, explicit meta tags, the source frontmatter embedded in the page itself. All of it lives in the HTML. If your HTML is an empty shell waiting for JavaScript, there is nothing for that metadata to live in, and nothing for a machine to read. Server-side rendering is not an MX feature. It is the precondition for every machine-readable thing MX asks you to add. The page has to exist in the response before its metadata can mean anything.

This is also one of the first things an MX audit checks. It fetches each page the way a non-rendering crawler does, compares that against the rendered version, and tells you exactly which content and which structured data are present for a browser but missing for a machine. It is the gap you cannot see from your own screen, because your own screen runs the JavaScript.

If you want to know what AI actually reads when it visits your site, start with an audit and I will show you the difference between what you publish and what a machine receives.