Your Site Is Already Training AI Models

10 June 2026 · Tom Cranstoun · 19 min read

Most content teams are watching the wrong thing.

They add an llms.txt file, argue about whether to block GPTBot, and check whether their pages turn up in ChatGPT answers. Meanwhile, every month, their pages are quietly harvested by a system that shapes what large language models know about the world, and almost nobody tunes for it.

That system is Common Crawl. You're already in it. The only question worth asking is whether you're in it well.

I work on a practice called Machine Experience, or MX. The short version: it's the work of making anything you publish - a page, a PDF, an image, a product listing - readable by the machines that now consume it, so no machine has to guess what you meant. Common Crawl is where a great deal of that guessing begins.

This post is about the archive, and it opens a three-part series on how AI systems read the web. The second post, What Most robots.txt Guides Get Wrong About AI Crawlers, is about the live crawlers, GPTBot, ClaudeBot, PerplexityBot and the rest, and what robots.txt does and doesn't control. The third, What AI Crawlers See When They Can't Run Your JavaScript, is about what those crawlers actually read when they arrive. Read the three together.

What Common Crawl Is

Common Crawl is a non-profit that has archived the public web since 2008. A new crawl lands roughly every month, fetches several billion pages, and stores them in a public archive on AWS. It's free. Anyone can use it.

Here's why it matters: nearly every major model has trained on it. GPT, Claude, Gemini, Llama, the models now answering questions about your industry, your products, and your company, all learned from Common Crawl data.

You're already in there. The thing in question is the quality of what they found.

Three Files, Three Jobs

For each page it crawls, Common Crawl writes three files. Knowing the difference tells you what actually matters.

WARC files hold the raw HTTP response: the complete HTML, every header, every byte. This is the archive. If your page was fetched, the whole thing is here, including every <meta> tag and every <script type="application/ld+json"> block.

WAT files hold computed metadata extracted from the WARC. These are JSON records carrying your HTTP headers, every <meta> tag, every link, your canonical URLs, and your script references. No body text. No raw HTML. Just the structured signals. WAT files are roughly twenty times smaller than WARC files and far easier to process at scale, which is exactly why researchers start here.

WET files hold extracted plain text. The HTML is stripped, the structure is gone. It's your words in a flat file, with nothing around them.

The order matters. WAT is the most accessible layer for pulling structured signal at scale. JSON-LD embedded in <head> lands in the WARC and gets separately mined by Web Data Commons, which publishes schema.org data extracted from billions of crawled pages. Plain text ends up in WET: useful for learning language, close to useless for understanding entities and the relationships between them.

Being in the Crawl Is Not Being in the Training Data

This is the part most people miss. Every major training pipeline filters Common Crawl hard before a model ever sees it. Google's C4 dataset drops pages without sentence-ending punctuation and removes duplicates. HuggingFace's FineWeb runs a quality classifier. ROOTS filters by language and quality score. Dolma, RedPajama, The Pile, each applies its own passes.

Pages that survive these filters share a profile: complete sentences, clear structure, explicit semantic markup, unique content. Pages that get dropped are thin, ambiguous, badly structured, or near-duplicates of something else.

There's nothing accidental about this. Quality filtering in LLM pipelines selects for the same properties that make content useful to a human reader. Clarity isn't only good practice. It's the survival criterion.

What the Crawler Does, and Does Not, Do

It reads your sitemap. CCBot supports the Sitemap Protocol and follows any sitemap declared in your robots.txt. If your pages aren't in the sitemap, they depend entirely on inbound links to be found. That's a real problem for new or isolated content.

It respects robots.txt. Block CCBot and your pages drop out of the open training dataset. Common Crawl also runs a formal opt-out registry for site owners who want permanent removal. This part matters: blocking CCBot only affects Common Crawl. GPTBot, ClaudeBot, PerplexityBot, and Googlebot run entirely independently. They don't read Common Crawl's opt-out list. Each one must be blocked separately in robots.txt.

The table shows what each major crawler actually does:

Crawler	Operated by	Purpose	Uses the CC archive?
CCBot	Common Crawl	Builds the open archive	It's CC itself
GPTBot	OpenAI	ChatGPT training and browsing	No, a separate crawl
ClaudeBot	Anthropic	Claude training	No, a separate crawl
PerplexityBot	Perplexity	Live retrieval for answers	No, live only
Googlebot	Google	Search index and AI Overviews	No, Google's own
Bingbot	Microsoft	Search index and Copilot	No, Microsoft's own

Blocking CCBot removes you from the open research network. It doesn't make you invisible to any commercial AI product.

It archives robots.txt itself as a dedicated dataset, kept separate from the main crawl, for researchers building robots.txt parsers.

It crawls text/plain files (a small fraction of the most recent crawl), so a plain-text llms.txt listed in your sitemap can be archived, though only as part of that small fraction. But Common Crawl is a passive archive. It stores the bytes; it doesn't read llms.txt to change how it crawls.

llms.txt, Honestly

llms.txt was proposed as a machine-readable file that tells AI systems what to include or exclude from a site. For a long time the honest answer was simple: the idea is reasonable, the adoption isn't there yet.

That has shifted, though not in the way most coverage suggests.

In May 2026, Google added an Agentic Browsing category to Lighthouse 13.3. It scores how well a site is built for machine interaction, using pass or fail checks rather than the traditional 0-to-100 number. Among the checks: presence of llms.txt, WebMCP integration, accessibility-tree integrity, and layout stability (CLS).

Google's stated reason: "Without llms.txt, agents may spend more time crawling the site to understand its high-level structure and primary content."

Here's the tension worth naming. Google Search has said plainly that llms.txt isn't needed for AI search visibility. Lighthouse now flags its absence as a failing audit. The most plausible reading is that Google is future-proofing for agentic browsing, where a software agent navigates your site on its own, without yet committing to it as a ranking signal for AI Overviews.

For Common Crawl specifically, nothing has changed. It'll store your llms.txt if it finds it, and it won't read it to change how it crawls. The training pipeline is unaffected either way.

There's a transport detail that decides whether llms.txt reaches the crawl at all, and it's the piece most implementations get wrong. Common Crawl archives HTML pages from your sitemap reliably, but it captures only a small fraction of plain-text files. A llms.txt served as plain text is therefore the less certain route into the archive. The fix, and what I do on this site, is to serve llms.txt as an HTML page that wraps the plain-text content in a <pre> block. The crawler ingests it like any other HTML page, so the full text lands in the archive; the <pre> renders it faithfully as the readable plain-text file a person expects. The monospaced, code-block look is the point, not a fault: it's a rendering I chose, not anything Common Crawl did, and it gives you HTML reach and plain-text readability in one file. I walk through the same content before and after the wrap, with working Cloudflare Worker code, in Why llms.txt Probably Isn't Working. As llms.txt moves from a nicety to a file people expect to find, getting the transport right is what separates a file that reaches the crawl from one that quietly doesn't.

So the practical position is: add llms.txt. It's now an official Lighthouse check, which means it shows up in performance reports, agency audits, and client dashboards. The cost is one file. The benefit is a pass on an increasingly visible check, plus genuine use for the agentic browsing that's coming regardless.

For a worked example, mx.allabout.network/llms.txt shows what a complete file looks like: an explicit content policy, a link to a full-site corpus (llms-full.txt) for agents that want everything in one fetch, and direct links to the core technical specifications. That's the pattern to follow, a machine-readable briefing rather than a minimal placeholder.

There's a larger point buried in that Lighthouse category. The Agentic Browsing checks put accessibility-tree integrity and layout stability alongside llms.txt. Google has now formally treated accessibility signals as agent-readiness signals: what makes a page usable by an agent is structurally the same as what makes it usable by a person with assistive technology. That idea is, at last, an official one.

What Actually Works

If you want your content well represented in AI systems, in training data and in live retrieval both, the stack looks like this.

Sitemaps declared in robots.txt are the main lever for crawl coverage. If CCBot can't find your pages, they don't get archived.

JSON-LD with schema.org vocabulary is preserved in the WARC, mined by Web Data Commons, and used in downstream training datasets. Tens of millions of domains now carry schema.org markup. The difference that counts is completeness and accuracy, not mere presence.

Here's what a complete, useful JSON-LD block looks like for a product page, set against what most sites actually publish:

<!-- What most sites publish: technically valid, practically useless -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Widget Pro"
}
</script>

<!-- What Common Crawl, and the models downstream, can actually use -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Widget Pro",
  "description": "A self-contained analytics dashboard for e-commerce teams, requiring no SQL knowledge.",
  "brand": {
    "@type": "Brand",
    "name": "Acme Corp"
  },
  "offers": {
    "@type": "Offer",
    "price": "49.00",
    "priceCurrency": "GBP",
    "availability": "https://schema.org/InStock",
    "url": "https://example.com/widget-pro"
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.7",
    "reviewCount": "312"
  }
}
</script>

The first version tells a model that a product called "Widget Pro" exists. The second tells it what the product does, who makes it, what it costs, whether it's in stock, and how buyers rate it. When someone asks an AI "what's a good analytics dashboard under 50 pounds", only the second version carries enough signal to be relevant. The first is noise.

The same holds for every page type: Article, FAQPage, LocalBusiness, Event, Person. Partial markup beats none; complete markup is the target.

Complete, explicit meta tags are captured straight into the WAT files, queryable at scale with no HTML parsing. Open Graph, canonical, description, robots directives, all of it counts.

A complete, accurate sitemap is the entry point. CCBot reads the sitemap declared in your robots.txt and uses it to find pages. Without it, you depend on other sites linking to yours. New pages, internal tools, and standalone files with no external inbound links simply don't get crawled.

A broken sitemap is worse than it sounds, because the failure is silent. CCBot fetches the sitemap URL, and if it gets malformed XML, a redirect it won't follow, or an HTML error page returned with a 200 status, it quietly falls back to link discovery. No error reaches you. You would never know unless you checked the Common Crawl index directly:

https://index.commoncrawl.org/CC-MAIN-2026-21-index?url=yourdomain.com/*&output=json

If key pages are missing or show a stale crawl date, a broken or incomplete sitemap is the likely cause. It's the kind of problem you can't see from the outside, which is most of what an MX audit exists to find. Of the optional fields a sitemap can carry, <lastmod>, <changefreq>, <priority>, only <lastmod> matters. CCBot uses it to decide whether to re-crawl a changed page. Set it accurately. The other two are largely ignored by every major crawler.

Canonical URLs matter because Common Crawl removes duplicates aggressively. If your content sits at several URLs with no canonical, one version survives, and not always the one you would choose.

Unique, well-structured content is the quality-filter threshold. Thin pages get dropped from training pipelines no matter how good their structured data is.

Which Snapshots Models Actually Train On

Not the latest. A mix, and the range is far wider than most people assume.

C4, the dataset Google built for T5 in 2019, came from a single crawl snapshot. That was the early approach. Modern training datasets work differently. FineWeb, published by HuggingFace and now widely used, draws on 96 Common Crawl snapshots, essentially every crawl from 2013 to 2024. A Mozilla Foundation study found that a large majority of the major LLMs published between 2019 and 2023 were trained on filtered Common Crawl data, with GPT-3 deriving more than 80 percent of its training tokens from it.

Dataset	Used by	Common Crawl coverage
C4	Google T5, and others	A single snapshot, April 2019
GPT-3 training data	OpenAI	Multiple crawls; about 82% of tokens
CCNet / Pile-CC	LLaMA, EleutherAI	Multi-year
OSCAR	BLOOM, multilingual models	Multi-year
FineWeb	Widely used	96 snapshots, 2013 to 2024

This has a consequence that's easy to miss. A model trained on FineWeb has seen your content as it stood in every monthly crawl going back a decade, not just the most recent one. The oldest version of a page often carries the most weight, not the newest, because it appears in the most snapshots.

Knowledge cutoffs aren't crawl cutoffs. If an AI states something confidently wrong about your company, something that was true in 2019 and corrected in 2021, it's usually because the 2019 version sits in far more training snapshots than the correction does. There's no mechanism for an update to travel backwards through a decade of archived crawls.

Old Versions Never Go Away

Each monthly crawl is a permanent snapshot. Common Crawl doesn't update or delete old records. If your 2023 page held wrong information, an outdated price, or no structured data at all, that version still sits in the archive, and in any training dataset built from that snapshot.

When you update a page, the next crawl writes a new record in that month's archive. The old record stays exactly where it is. Both versions exist at once, in different monthly snapshots, indefinitely.

The practical effect: models aren't trained on a single crawl. Major datasets combine many snapshots spanning years. The aggregate of your content across all of them is what shapes how AI systems understand you. If you spent three years publishing thin structured data and fixed it this month, models trained on the historical data still carry the impoverished version.

You can't fix the past. But every month you publish well-structured, clearly marked-up content, a better snapshot goes in. The effect compounds in both directions: good practice accumulates forward, old neglect accumulates backward.

Setting <lastmod> accurately is the one lever that helps. It tells CCBot the page changed and should be re-crawled in the next cycle, giving the new version its best chance of landing in the next snapshot.

You Cannot Force a Re-Crawl

No, and this surprises most people who are used to Google Search Console.

There's no equivalent tool for Common Crawl. You can't submit a URL, request a re-crawl, or flag a correction. CCBot decides what to crawl and when, from its own internal scoring (Apache Nutch CrawlDB), weighted by inbound links, historical crawl frequency, and sitemap signals. It doesn't take requests.

The levers you do have:

<lastmod> in your sitemap is the strongest signal you control. Set it accurately when a page changes and CCBot will re-crawl it sooner in the next monthly run. The key word is accurately. If you stamp it with today's date on every build whether the content changed or not, CCBot learns to ignore it.

Inbound links raise crawl frequency. Pages that many external sites link to get revisited more often. A page nobody links to may show up only every few crawls, or never.

robots.txt hygiene means making sure CCBot isn't blocked by accident. It's an easy mistake when you add a blanket Disallow for other crawlers. If CCBot is blocked, nothing tells you. You simply fall out of future archives.

The sitemap URL in robots.txt has to be current. If your sitemap moved and robots.txt still points at the old location, CCBot won't find it, and won't tell you.

What you can't do: force an immediate re-crawl of a page, submit new URLs directly, request removal of a past snapshot (opt-out only stops future crawls), or see your place in the queue.

If you update a page today, the earliest a new version reaches Common Crawl is the next monthly crawl, and only if CCBot picks your page that cycle, which isn't guaranteed for every page every month.

The one feedback channel is the CC Index API:

https://index.commoncrawl.org/CC-MAIN-2026-21-index?url=yourdomain.com/your-page&output=json

It tells you when CCBot last visited, what HTTP status it got, and whether the page is in the current index. If a key page is missing or shows a stale date, you know there's a problem. The fix is indirect: correct the sitemap, clear any accidental block, build inbound links, and wait for the next cycle.

Why AI Says Wrong Things About Your Company

This is the question most teams eventually ask, and the answer almost always leads back here.

When an AI states something wrong about your company, a product you discontinued, a price you changed, a description that hasn't been accurate for three years, it isn't hallucinating in the inventive sense. It's faithfully reporting what was true in the Common Crawl snapshots it trained on.

The 2019 version of your pricing page lives in C4. The 2021 version of your about page is in Pile-CC. The 2022 version of your product description is in FineWeb. All of them, weighted by how many snapshots they appear in, shaped what the model learned about you.

Corrections don't travel backwards. A model trained before your update knows nothing about the update. It won't find it unless it's retrained, which happens rarely and unpredictably, or unless it has live retrieval and your current page is structured well enough to surface.

So the structured data and explicit metadata on your current pages do two jobs at once: they shape future training runs, and they give the clearest possible signal to live-retrieval systems that can override stale training knowledge. Both matter. Both rest on the same underlying quality.

What This Looks Like on My Own Site

The checklist above isn't theory. Here's how it's applied on this site.

llms.txt and llms-full.txt are both published, served as HTML that wraps the plain text in a <pre> block. An agent that wants a high-level briefing reads llms.txt. An agent that wants everything, every published page in one file, fetches llms-full.txt. One request, full corpus, no crawling required. That's the pattern the convention intended and almost nobody implements. Serving them as HTML gets them ingested by Common Crawl like any other page; the <pre> renders them as the clean plain-text files a reader expects. HTML reach and plain-text readability, in one file.

robots.txt declares the sitemap and allows all crawlers. User-agent: * with Allow: / means CCBot, GPTBot, ClaudeBot, PerplexityBot, and every other crawler has access. No accidental blocks. The exclusions are deliberate and limited to the book appendices and private drafts.

The content policy is stated in llms.txt, in plain words a machine can act on: "AI agents may cite, summarise, and recommend content from this site with attribution to CogNovaMX (a trading name of Digital Domain Technologies Ltd) and Tom Cranstoun." No agent has to guess whether this content is reusable.

AI contributor profiles are published. There are pages for Claude Code, Claude Sonnet 4.5, and Microsoft Copilot alongside the author profile. Declaring which AI models contributed to content is a provenance signal almost no site makes explicit.

An AI usage statement is published at /AI-USAGE.html. Not a footnote in the footer, a standalone, crawlable declaration.

The lastmod dates are real. The sitemap reflects actual content-change dates, not a "now" timestamp regenerated on every build, so CCBot can re-crawl the changed pages first.

None of this needed a new CMS, a platform change, or much engineering time. It needed decisions about which signals to make explicit, and then making them explicit.

Where MX Fits

Everything above, complete structured data, explicit metadata, accurate sitemaps, canonical signals, content that survives the quality filters, isn't a checklist. It's an architecture. MX is the discipline of building digital experiences that are machine-readable by design: not retrofitted, not half-compliant, but clear from the ground up. The visible page is for human eyes; the metadata behind it is for the machine; both say the same thing.

There's a faster way to find out where you stand than reading your own pages and hoping. An MX audit runs every signal in this post against your live site: sitemap and robots.txt reachability, CCBot access, JSON-LD completeness, canonical coverage, llms.txt, and whether your lastmod dates are real. It tells you, page by page, where a crawler or a model would be left guessing about you, and it reports back in a form you can verify yourself rather than take on trust.

If you're wondering how your company shows up in AI systems, or whether your content is in Common Crawl at all, start with an audit and I'll walk you through what it finds.