What AI Crawlers See When They Can't Run Your JavaScript

10 June 2026 · Tom Cranstoun · 8 min read

In the first two posts of this series I kept arriving at the same fix and then walking past it. The Common Crawl post said that what survives a training pipeline is whatever is actually in your HTML. The robots.txt post said only Googlebot renders JavaScript, and every other major AI crawler fetches the raw HTML and processes that alone. Both ended on a version of the same sentence: if it isn't in the raw HTML response, it doesn't exist for the machine.

This post is about that sentence. It's the one that decides whether any of the rest matters.

It closes a three-part series on how AI systems read the web: Your Site Is Already Training AI Models covered the archive most models train on, What Most robots.txt Guides Get Wrong About AI Crawlers covered the live crawlers and what robots.txt controls, and this one covers what those crawlers actually read when they arrive.

The Asymmetry That Decides Everything

Googlebot renders JavaScript. It loads your page, runs your scripts, waits for the DOM to settle, and indexes what a person would see. As far as I know, it's the only major crawler that does.

GPTBot, ClaudeBot, CCBot, OAI-SearchBot, PerplexityBot: none of them run your JavaScript. They make an HTTP request, they get back whatever your server sends in that first response, and that's the page. No script execution, no waiting, no second pass. The raw HTML isn't a step on the way to your page. For these crawlers, it is your page.

So the question that matters is simple, and most teams have never asked it: what's in your HTML before a single line of JavaScript runs?

See Your Site the Way a Crawler Does

You don't need special tools. You need to look at the response your server actually sends, not the page your browser assembles from it.

The fastest check is curl:

curl -sL https://example.com/your-page | grep -iE '<h1|<title|application/ld\+json'

If your headline, your title, and your structured data come back, a crawler can see them. If that command returns an empty <title>, no <h1>, and no JSON-LD, then neither can the crawler, no matter how complete the page looks in your browser.

The browser equivalent is View Source, not Inspect. Inspect shows the rendered DOM after JavaScript has run; View Source shows the raw response. If your content isn't in View Source, you're looking at what every non-rendering crawler gets.

The starkest test of all: turn JavaScript off in your browser and reload. What remains is, near enough, what GPTBot and ClaudeBot read.

Where the Content Goes Missing

A handful of common patterns hide content from machines that don't render. Each is invisible in normal use, because your browser papers over all of them.

The client-side-rendered shell. A single-page app often ships HTML that's almost empty:

<body>
  <div id="root"></div>
  <script src="/static/bundle.js"></script>
</body>

The text, the headings, the structured data, all of it arrives later, written into that <div> by JavaScript. A person sees a full page. A crawler sees an empty box and a script it won't run.

Here is the same page when the server renders it:

<body>
  <main>
    <h1>Analytics for e-commerce teams</h1>
    <p>A self-contained dashboard that needs no SQL knowledge.</p>
    <script type="application/ld+json">
    { "@context": "https://schema.org", "@type": "Product", "name": "Widget Pro" }
    </script>
  </main>
</body>

Same app, same framework, different output. The second version carries its meaning in the first response.

JSON-LD injected by JavaScript. This one catches careful teams. You added complete structured data, you tested it, it passed. But a tag manager or a client-side schema plugin injected it after load. Googlebot renders, so Googlebot sees it; every AI crawler fetches raw HTML, so none of them do. Your structured data is real for search and absent for AI. The Common Crawl post makes the case for complete JSON-LD; this is the catch that quietly cancels it.

Client-set metadata. Title, meta description, canonical link, and Open Graph tags written by JavaScript after load have the same problem. The crawler reads the server's version, which is too often a generic placeholder.

Lazy-loaded and infinite-scroll content. Anything that loads on scroll, on click, or by a fetch after the page settles is content the crawler never triggers. No scrolling, no clicking - it reads the first response and moves on.

What Each Rendering Choice Means for AI

The fix has a name, but it helps to see where each common approach lands.

Approach	What the crawler gets	AI-readable?
Client-side rendering (CSR)	An empty shell plus a script bundle	No
Server-side rendering (SSR)	Complete HTML in the first response	Yes
Static site generation (SSG)	Complete HTML, pre-built	Yes
SSR with hydration	Complete HTML, then JS enhances it	Yes, if the server HTML is complete
Streaming or incremental SSR	Complete HTML, sent in pieces	Yes, once the meaningful HTML is in the response
Dynamic rendering (bot-only prerender)	Pre-rendered HTML, but only for detected bots	Works, with caveats below

The headline is the simple one. If the meaningful content and metadata are in the HTML the server sends, the approach works for AI. The frameworks people reach for, Next.js, Nuxt, Astro, SvelteKit, Remix, all offer an SSR or static mode that produces complete first-response HTML. The framework isn't the point. The output is. A static site built by hand passes this test as easily as the newest meta-framework.

Do Not Solve It by Cloaking

There's a tempting shortcut: detect the crawler by its user-agent and serve it a special pre-rendered version while humans keep the client-side app. This is dynamic rendering, and Google describes it as a stopgap, not a recommended long-term setup.

Two problems. First, it's fragile. You're maintaining a second rendering path and a bot list that goes stale the moment a new crawler appears, and the robots.txt post showed how many there now are. Second, it shades into cloaking: serving materially different content to bots than to people, which crosses from workaround into something search guidelines treat as a violation. The line is whether the bot and the human get the same content. Keep them the same and you're fine. Let them drift and you're exposed.

Server-side rendering sidesteps the whole problem. Every crawler, every person, every screen reader gets the same complete HTML. There's no second path to maintain and nothing to keep in sync. The honest fix is also the simpler one.

The Accessibility Overlap Is Not a Coincidence

A page that works without JavaScript works for a screen reader, for a slow connection, for a browser extension, and for an AI agent. These are the same property seen from different angles: the meaning is in the markup, not locked behind script execution. The first post noted that Google's new Agentic Browsing checks put accessibility-tree integrity alongside llms.txt. This is why. What makes a page legible to an agent is, structurally, what makes it legible to a person using assistive technology. Build for one and you have built for both.

The Test That Settles It

You don't need to audit your whole stack. You need one answer: are the content and metadata you care about present in the raw HTTP response, before any JavaScript runs?

Fetch the page with curl. Read the source. Turn JavaScript off and reload. If your headline, your body text, your structured data, and your meta tags are all there, you're readable. If they appear only after the scripts run, every AI crawler in this series is reading a hollow version of your page, and so is everything trained on it.

Where MX Fits

MX is metadata that travels in the carrier: structured data, explicit meta tags, the source frontmatter embedded in the page itself. All of it goes into the HTML. If your HTML is an empty shell waiting for JavaScript, there's no carrier for that metadata and nothing for a machine to read. Server-side rendering isn't an MX feature. It's the precondition for every machine-readable thing MX asks you to add. The page has to exist in the response before its metadata can mean anything.

This is also one of the first things an MX audit checks. It fetches each page the way a non-rendering crawler does, compares that against the rendered version, and tells you exactly which content and which structured data are present for a browser but missing for a machine. It's the gap you can't see from your own screen, because that device runs the JavaScript.

If you want to know what AI actually reads when it visits your site, start with an audit and I'll show you the difference between what you publish and what a machine receives.