Index

Why LLMs Do Not Execute JavaScript (But Google Does)

If you have noticed that AI assistants struggle with modern single-page applications, you might assume they just have not invested in JavaScript rendering yet. That is not quite right. The real reason reveals something more interesting about how LLMs acquire knowledge versus how search engines index the web.

The difference is not about technology. It is about purpose. Once you see it, you realise you already have a new class of users visiting your website: machines.

Machines are users too

Your website has always had both human and machine visitors. Search engine crawlers have been users for decades. That population now also includes:

  • AI training pipelines (Common Crawl)
  • AI browsers (Claude in Chrome, Arc)
  • AI agents accomplishing tasks
  • Browser extensions extracting data
  • Integration tools

These machine users have different constraints from human users:

  • They cannot execute JavaScript.
  • They cannot consume updates that happen too rapidly.
  • They need explicit context rather than visual cues.
  • They visit once and cache what they see.

Machine Experience is treating these users as first-class and designing for them using technology we already have.

How LLMs actually learn about the web

LLMs do not build their knowledge by visiting your website. They train on datasets like Common Crawl, a simple scraper that periodically grabs HTML from billions of pages. No JavaScript execution. No browser rendering. Just the raw HTML text.

When Common Crawl encounters your React or Vue application, it gets the skeleton, the bare <div id="app"></div> and your JavaScript bundle references. That is it. No rendered content, no populated data, no information about what your site does.

But here is the thing: Common Crawl is not trying to capture your current data. It is trying to understand what your site is.

The snapshot problem

Consider a stock-tracking website. If Google visits it, Googlebot renders the JavaScript and indexes the current stock prices. That makes sense for search, someone querying "AAPL stock price" wants today's value, and Google needs to index current state.

But what would an LLM do with that snapshot? Train on the fact that Apple's stock was £187.42 on 15 March 2024? That is worthless knowledge. By the time the model is deployed, that price is historical noise.

Even if you server-rendered your stock tracker, you would just be feeding Common Crawl different snapshots. If your prices update every second, you would be generating server-rendered pages continuously, massive effort, no benefit. Common Crawl would catch one snapshot anyway, containing one moment's worth of data that is immediately outdated.

The same applies to weather sites, countdown timers, calendar applications, live sports scores, anything where the specific values change constantly.

What Common Crawl actually wants

Common Crawl wants context and structure. It wants to understand:

  • This is a financial website.
  • It tracks technology stocks.
  • It has company profiles and analysis sections.
  • It provides market commentary.

It does not care what Apple's stock price is right now. It cares that this site tracks stock prices.

When you server-render your Vue application, you are not helping Common Crawl capture your dynamic data. You are helping it understand your site's purpose and structure. The content in your HTML provides context: navigation labels, section headings, explanatory text, metadata.

A client-side rendered app gives Common Crawl almost nothing to work with. A server-rendered version gives it the information architecture, the content categories, the relationships between sections, the material that actually matters for understanding what the site does.

Why Google takes a different approach

Google visits sites on a schedule. More important sites get crawled more frequently. It renders JavaScript because its business is returning current results for search queries. When someone searches for something, Google needs to show what exists now, not what existed when Common Crawl last passed by.

That is a fundamentally different goal from building training data. Google indexes current state for retrieval. Common Crawl captures structure and context for understanding.

Google creates snapshots too, public-facing versions of sites without logins or session state, the kind of thing a first-time visitor would see. But those snapshots feed a search index that gets updated regularly, not a language model trained once on historical data.

The Machine Experience angle

This is where Machine Experience becomes relevant. If you want AI systems to understand your site, give them context and structure, not dynamic values.

Server-side rendering helps because it puts your information architecture into the HTML. Your navigation structure, content hierarchy, section purposes, metadata, all the material that helps a scraper understand what your site is about.

But you do not need to server-render every dynamic value. If your countdown timer shows "23 days, 4 hours, 17 minutes", Common Crawl does not need that precision. It needs to understand "this site has event-countdown functionality."

If your stock tracker shows live prices, Common Crawl does not need those specific numbers. It needs to understand "this is a financial site focused on technology-sector equities."

What matters for AI consumption

For sites that AI systems need to understand, documentation, product information, company websites, technical references, think about what knowledge you want to convey:

Not this: The current price is £187.42
But this: We provide real-time stock market data for technology companies

Not this: Event starts in 23 days, 4 hours
But this: We help people track important dates and deadlines

Not this: Today's temperature is 18°C
But this: We provide weather forecasts and historical climate data

The static content, your explanatory text, navigation labels, product descriptions, documentation, is what needs to be in the HTML. The dynamic values that change constantly are not useful training data anyway.

Learning from accessibility: screen readers face the same problem

This challenge is not new. Screen readers have dealt with dynamic content for years, and the accessibility community developed solutions that apply directly to Machine Experience.

A screen reader is a non-visual consumer of content designed for visual, instantaneous perception. An AI system scraping a page is a non-interactive consumer of content designed for interactive, real-time engagement. Both face the same fundamental problem: content that updates too rapidly to consume passively.

How screen readers handle dynamic content

Developers mark sections that update with ARIA live regions:

<div aria-live="polite" aria-atomic="true">
  23 days, 4 hours, 17 minutes remaining
</div>

The aria-live attribute has three values:

  • off, do not announce updates (the user navigates when they want the current value)
  • polite, announce when the user is idle
  • assertive, interrupt immediately

The countdown-timer problem

If you set aria-live="polite" on a second-by-second countdown, the screen reader would constantly interrupt: "59 seconds, 58 seconds, 57 seconds." Completely unusable.

The solution: mark it aria-live="off". Let users navigate to the timer when they want the current value, rather than forcing continuous updates they cannot keep up with.

The parallel is exact

  • Screen reader users choose when to check the timer by navigating to it.
  • AI systems need to choose when to query live data rather than trusting a snapshot.

Both need signals about what updates too rapidly to consume passively.

This validates the MX approach. We are not inventing something new. We are applying proven accessibility patterns to machine consumers. The web already has mechanisms for marking dynamic content. MX extends that thinking.

Common Crawl: how training data actually works

There is a lot of folklore about how Common Crawl operates. Getting the reality straight matters for Machine Experience.

What Common Crawl actually does

Common Crawl is not an AI agent. Its bot, CCBot, is a scraper that visits HTML pages, checks robots.txt first, and only fetches a page if crawling is allowed. The relevant facts for site owners:

  • robots.txt is honoured. Add a User-agent: CCBot block with Disallow: / and CCBot will stop crawling the site. It re-checks robots.txt periodically, so the change takes effect on the next pass.
  • Crawl-delay is honoured. Raising it slows the crawl rate for your domain.
  • Sitemaps are used. CCBot reads any Sitemap URL announced in robots.txt.
  • Identification is verifiable. CCBot runs from a published list of IP ranges with reverse DNS under crawl.commoncrawl.org, so you can confirm a request is genuinely from Common Crawl rather than a spoofer.
  • ML opt-out signals are observed. Common Crawl treats the Robots Exclusion Protocol as one of the ways website owners can state whether their content should be part of datasets used for machine learning.
  • Retrospective removal happens on request. When publishers like the NYT and Danish media groups asked for their content to be removed from past crawls and blocked future access via robots.txt, Common Crawl planned to comply, though its executive director warned that removing archived material threatens the open web.

The MX caveat

robots.txt is a voluntary signalling mechanism, not a legal enforcement mechanism. It expresses preferences, not permissions. CCBot respects it. Not every bot calling itself a crawler does. For sites you care about, verify the traffic by IP and reverse DNS rather than trusting the user-agent string alone, and assume that any rule designed to keep content out of AI training is honoured by the well-behaved scrapers and ignored by the rest.

The llms.txt problem stands separately

None of the robots.txt behaviour above helps a crawler find an llms.txt file. CCBot follows robots.txt, reads sitemaps, and indexes HTML. A raw .txt file with no entry in your sitemap is invisible to it. That is the problem the next section addresses.

The llms.txt problem

The llms.txt file is a proposed convention for declaring how LLMs should use your content. The problem: Common Crawl will not find it.

Why not?

  • It is not HTML (Common Crawl primarily harvests HTML).
  • It is not in your sitemap (typically).
  • There is no standard discovery mechanism.
  • It is a .txt file that machines have no reason to look for.

The Machine Experience solution: keep llms.txt as it is, serve it as HTML for Common Crawl

Two changes are enough. Do not move the file. Do not ship a second file. Keep the canonical /llms.txt exactly as authoring tools, MCP clients, and humans expect it. Change what the edge serves and what your sitemap lists.

1. Wrap /llms.txt as HTML at the edge.

The source-of-truth file on disk stays raw markdown so any tool fetching /llms.txt as text still works. A Cloudflare Worker (or equivalent) intercepts the request for that one URL, fetches the raw content, and serves it back wrapped in a minimal HTML document with Content-Type: text/html. The content inside is byte-identical to the original markdown, sitting in a <pre> block. Common Crawl now treats it as an HTML page and indexes it.

export default {
  async fetch(request) {
    const url = new URL(request.url);
    if (url.pathname === '/llms.txt') {
      const content = await fetch('https://your-origin.com/llms.txt')
        .then(r => r.text());

      const html = `<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>llms.txt</title>
</head>
<body>
<pre>${content}</pre>
</body>
</html>`;

      return new Response(html, {
        headers: { 'Content-Type': 'text/html; charset=utf-8' }
      });
    }
    return fetch(request);
  }
};

2. Add /llms.txt to sitemap.xml.

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/llms.txt</loc>
    <lastmod>2026-01-24</lastmod>
  </url>
</urlset>

Sitemap inclusion is what tells Common Crawl the URL exists at all. HTML wrapping is what makes it index the content once it arrives.

The full recipe, including the production version of the wrapper (which also injects JSON-LD, OG, and Twitter metadata into the wrapper <head> so the page carries proper agent-facing signals), is in the companion post: Why llms.txt Isn't Working, and How to Fix It.

Why this matters

This is Machine Experience thinking. Do not assume proposed conventions will work just because they have been proposed. Understand how the actual infrastructure operates (Common Crawl scrapes HTML from sitemaps), and design accordingly. Keep what tooling expects. Reshape only what the crawler needs.

llms.txt and robots.txt as ephemeral operational files

An important distinction: llms.txt and robots.txt are not static sovereign data. They are ephemeral operational instructions that change as your site evolves.

llms.txt is ephemeral:

  • You update it every time you add a page to your sitemap.
  • It changes when you reorganise content.
  • It evolves as you refine what machines should know about your site.
  • It reflects current site structure, not historical state.

robots.txt is ephemeral:

  • You modify it when visitor patterns change.
  • You update it when you discover unwanted crawling.
  • It changes as you add new sections or retire old ones.
  • It reflects current operational needs, not permanent rules.

Both are operational configuration files, not content. A snapshot of robots.txt from January 2025 does not tell you about the site's state in January 2026. These files document "how to interact with this site right now". That is ephemeral by definition.

YAML frontmatter for metadata

Machine Experience books practise what they teach. The missing piece with llms.txt is metadata. Machines need context about the file itself. Add YAML frontmatter:

---
title: "LLM Usage Guidelines"
description: "Instructions for AI systems using this site"
version: "2.1.0"
modified: "2026-01-24"
update-frequency: "weekly"
ephemeral: true
reason: "Updated with each sitemap change"
author: "Tom Cranstoun"
site: "https://example.com"
---

# Markdown Model

This site is optimised for LLM understanding...

This provides everything machines need:

  • What the file is (title, description)
  • When it was updated (modified, version)
  • How often it changes (update-frequency)
  • Whether to trust snapshots (ephemeral: true)
  • Why it is ephemeral (reason)
  • Who maintains it (author)
  • What site it describes (site)

YAML frontmatter is already understood by static site generators, markdown processors, and increasingly by AI systems. It is machine-readable metadata using existing technology.

Do not just create llms.txt and hope machines understand its purpose. Give them the metadata they need to use it correctly.

The implication

If Common Crawl scraped your llms.txt in March and you updated it in April, the training data contains stale instructions. The YAML frontmatter explicitly signals this with ephemeral: true and update-frequency: weekly. Machines know they should check for updates, not cache one version forever.

This is another reason to treat machines as active users, not one-time scrapers. They need current operational instructions, not historical snapshots. And they need metadata telling them how to handle those instructions.

The practical outcome

Client-side rendering is not inherently bad for AI consumption. A React app with good semantic HTML, clear navigation, descriptive text, and proper metadata can be understood by Common Crawl, if the meaningful content is in the initial HTML.

The problem is when all your content is generated client-side. When the only HTML is <div id="app"></div>, there is nothing for a scraper to find.

Server-side rendering helps not because it captures your dynamic data, but because it ensures your structure and context exist in the HTML that scrapers actually see.

The MX perspective: serving machine users

Machine Experience is not about inventing new standards. It is about treating machines as users and applying the same UX thinking we use for human users.

Who are your users? Humans and machines.

What are their constraints?

  • Humans with screen readers cannot consume rapid visual updates.
  • Machines scraping pages cannot execute JavaScript or consume second-by-second changes.

What technology already exists to serve them?

  • ARIA already marks content for non-visual consumption.
  • Meta tags already provide page-level context.
  • Semantic HTML already structures information.

Use it.

Machine Experience separates two types of dynamic content based on whether snapshots provide meaningful information:

  • Sovereign dynamic data: current state that is meaningful (product specifications, documentation versions, policy updates), snapshots are valid knowledge.
  • Ephemeral dynamic data: values that change so rapidly snapshots are meaningless (stock prices updating every second, live scores, countdown timers).

For sovereign data, expose the state. For ephemeral data, signal that snapshots cannot be trusted.

This matters for every machine user visiting your site right now:

  • Training pipelines (Common Crawl), understand the site's purpose but do not train on ephemeral values.
  • AI browsers (Claude in Chrome, Arc), know which data to cache versus query fresh.
  • Browser extensions, understand what data is reliable versus fleeting.
  • AI agents, see when to use live APIs instead of page snapshots.
  • Scraping tools, distinguish structural information from time-sensitive data.
  • Integration frameworks, know whether cached responses are valid.

Using existing technology to serve machine users

We already have the tools. Screen readers taught us how to mark content for non-visual users. ARIA live regions tell assistive technology which content updates too rapidly to announce continuously. Machine users have the same constraint, they cannot consume second-by-second updates.

Use the same technology.

Page-level signal, meta tag (existing technology):

<!-- All content is ephemeral -->
<meta name="mx:dynamic" content="true"
      data-reason="Stock prices update every second">

<!-- Mixed content, some ephemeral, some sovereign -->
<meta name="mx:dynamic" content="partial"
      data-reason="Live scores update every second">

<!-- All content is sovereign (or omit the tag entirely) -->
<meta name="mx:dynamic" content="false">

Element-level signal, ARIA (existing technology):

When you declare content="partial", ARIA attributes tell machine users exactly what updates:

<head>
  <meta name="mx:dynamic" content="partial"
        data-reason="Stock prices update every second">
</head>

<body>
  <!-- Sovereign data, no ARIA needed -->
  <h1>Stock Market Dashboard</h1>
  <p>Real-time tracking of technology-sector equities</p>

  <!-- Ephemeral data, aria-live marks it -->
  <div aria-live="off">
    <span class="ticker">AAPL</span>
    <span class="price">£187.42</span>
    <span class="change">+2.3%</span>
  </div>
</body>

This serves both classes of non-visual users:

  • Screen reader users navigate to the price when they want the current value.
  • Machine users know these specific values are ephemeral snapshots.
  • Both get context about update frequency from data-reason.
  • Both use the same ARIA markup.

No new technology. No additional attributes. Just applying UX thinking to machine users.

The data-reason attribute matters. It is not just a boolean flag, it is explicit, human-readable context that prevents AI hallucination.

Without this signal, an AI visiting your stock page might:

  • See "AAPL: £187.42" in a cached snapshot.
  • User asks "what is Apple's stock price?"
  • AI responds "£187.42" based on three-hour-old data.
  • That is hallucination, presenting stale information as current.

With the context:

  • data-reason="Stock prices update every second"
  • The AI knows: this is a stock tracker, these values are ephemeral, do not trust the snapshot.
  • Correct reasoning: "I need to query live data, not use this cached page."

This is core MX: give machines the context they need to reason correctly, rather than forcing them to guess. The tag provides sovereign data about the page itself, meta-information that prevents incorrect assumptions about data validity.

Do not make AI think. Give it the context.

For a pure stock tracker (all content ephemeral):

<meta name="mx:dynamic" content="true"
      data-reason="Stock prices update every second">

For a weather site (forecasts are analysis, temperatures are ephemeral):

<meta name="mx:dynamic" content="partial"
      data-reason="Temperature and conditions update hourly, forecasts updated twice daily">

For a countdown-timer page (the entire purpose is the timer):

<meta name="mx:dynamic" content="true"
      data-reason="Timer values change continuously based on target date">

For a news article (static once published):

<!-- No tag needed, absence means content="false" -->

Whether browser extensions, AI browsers, training pipelines, or agent frameworks adopt this convention depends on broader take-up of MX principles. The discipline proposes it because it addresses a real need: helping machines distinguish data they can trust from a snapshot versus data they need to query live or ignore entirely.

This maintains MX principles while acknowledging that not all dynamic content has the same temporal validity. The page structure, navigation, metadata, and explanatory text remain useful for understanding. The specific numbers at any given moment require different handling.

Looking forward

LLMs do not execute JavaScript because they do not learn about the web by visiting sites. They learn from datasets created by simple scrapers. Those scrapers need HTML that explains what your site is, not what values it happens to show at any given moment.

Google executes JavaScript because its business model requires indexing current state for search results. That is a different use case with different economics.

This distinction changes how you think about making sites AI-accessible. It is not about rendering every dynamic value server-side. It is about making sure your site's purpose, structure, and context exist in the HTML, the part that actually gets scraped, archived, and used for training.

Machine Experience is treating you as having a new class of users, machines, and using existing technology to serve them properly.

For pages with sovereign data, make that state visible. For pages with ephemeral values, use meta tags and ARIA to signal what updates too rapidly to trust from snapshots.

The technology already exists. ARIA already marks content for non-visual consumption. Meta tags already provide page-level context. You are not inventing new standards, you are treating machines as users and applying the same UX thinking you use for humans.

That is Machine Experience: understanding what your machine users need, and using the tools you already have to serve them.

Back to Top