Index

What Most robots.txt Guides Get Wrong About AI Crawlers

In my last post I said most content teams are watching the wrong thing. They add an llms.txt, they argue about whether to block GPTBot, and they check whether they turn up in ChatGPT. This post is about the middle one: the GPTBot argument, and why it is almost always built on a wrong picture of how AI crawlers actually work.

Most robots.txt guidance treats AI crawlers as a single category. Block them, or allow them. That framing is wrong, and acting on it costs you either search visibility or control over your training data, sometimes both.

This is the second post in a three-part series on how AI systems read the web. The first, Your Site Is Already Training AI Models, covered Common Crawl, the open archive nearly every model trains on. That one was about the archive; this one is about the live crawlers and the file most people think controls them. The third, What AI Crawlers See When They Can't Run Your JavaScript, is about what those crawlers actually read once they reach your page.

Here is what the guides miss.

1. Only Googlebot Renders JavaScript

Googlebot renders JavaScript. It processes your page the way a browser would: it waits for the DOM to settle, then indexes what a user would actually see.

Every other major AI crawler, CCBot, GPTBot, ClaudeBot, PerplexityBot, does not. They fetch the raw HTML response and process that, and nothing more. If your content is rendered client-side, injected by a framework after load, or lazy-loaded on scroll, it is invisible to all of them.

This bites hardest for:

  • JSON-LD injected by JavaScript. Only the server-rendered version is safe.
  • Text loaded by fetch or XHR after the page loads.
  • Metadata populated by a CMS front-end in the browser.

The fix is server-side rendering for anything you want AI systems to see. If it is not in the raw HTML response, it does not exist for a training crawler.

2. Google Runs Three Crawlers, Doing Three Different Jobs

Googlebot is the main crawler. It fetches pages for the search index, renders JavaScript, and visits often. Block it and you disappear from Google Search.

Google-Extended is not a crawler at all. It is a data-usage directive: a signal to Google's internal pipeline not to use your content for Gemini and its AI training products. The content is still fetched, by Googlebot. You can block Google-Extended and keep Googlebot, and your search rankings are untouched.

User-agent: Google-Extended
Disallow: /

That one line keeps you in Google Search and out of Gemini training. Most publishers do not know the option exists.

APIs-Google handles structured-data testing tools and developer requests. Leave it alone.

3. OpenAI Runs Three Crawlers, and One Ignores Your robots.txt

GPTBot is training data for GPT models. It fully respects robots.txt.

OAI-SearchBot powers ChatGPT Search. It also respects robots.txt, but independently of GPTBot. Block GPTBot and not OAI-SearchBot and you are out of training data but still in ChatGPT's search results. Block both and you are out of both.

ChatGPT-User fires when a real person asks ChatGPT something that makes it visit your page. OpenAI's own documentation says: "because these actions are initiated by a user, robots.txt rules may not apply." It behaves the way a person with a browser would. So you cannot rely on robots.txt to block all AI access to your content.

4. Perplexity Splits the Same Way, and Then Went Further

PerplexityBot indexes content for Perplexity's search results. Not training. It respects robots.txt. Block it and you vanish from Perplexity answers.

Perplexity-User is user-triggered. When someone asks Perplexity a question and it fetches your page to answer, this is the agent that arrives. Robots.txt may not apply, and Perplexity documents that openly.

There is a harder problem here. In 2024, Cloudflare reported that Perplexity was using stealth, undeclared crawlers: systematically changing user-agent strings and rotating IPs and ASNs to get around robots.txt and the firewall rules that site owners had put in place. Customers had blocked both declared Perplexity bots, in robots.txt and at the firewall. Perplexity was still pulling their content using unidentified crawlers.

Perplexity's position was that Perplexity-User is user-triggered and therefore exempt. Cloudflare's investigation read that as cover for automated crawling under false identities.

The practical takeaway: robots.txt is not a reliable barrier against Perplexity. If exclusion genuinely matters to you, you need server-side access controls.

5. Anthropic Follows the Same Pattern as OpenAI

ClaudeBot is Anthropic's training crawler. It respects robots.txt. This is what shapes what Claude models know about your organization.

Claude-SearchBot powers Claude's live search and retrieval. It is separate from ClaudeBot, so you can block training without blocking retrieval, or the reverse.

Claude-User is user-triggered browsing from Claude. Same caveat as ChatGPT-User: a real person started it, so robots.txt may not apply.

The opt-out decisions for Anthropic, OpenAI, and Google are all independent of each other. Blocking one does nothing to the others.

6. There Are Far More Crawlers Than the Five Everyone Names

The landscape is wider than the headline names. These are the ones worth having in your robots.txt if you are making explicit decisions:

Crawler Operator Purpose Respects robots.txt
GPTBot OpenAI ChatGPT training Yes
OAI-SearchBot OpenAI ChatGPT Search Yes
ClaudeBot Anthropic Claude training Yes
PerplexityBot Perplexity Perplexity answers Yes (declared)
CCBot Common Crawl Open training archive Yes
Google-Extended Google Gemini training directive N/A (not a crawler)
Applebot-Extended Apple Apple Intelligence N/A (not a crawler)
Amazonbot Amazon AWS AI / training Yes
Meta-ExternalAgent Meta Llama training Vendor-dependent
Bytespider ByteDance TikTok / AI training No, masquerades as other agents
cohere-ai Cohere Enterprise AI training Yes
Diffbot Diffbot Structured-data extraction Varies

Bytespider deserves a specific warning. ByteDance's crawler has been seen rotating user-agent strings to pose as Chrome, Safari, or another browser. A User-agent: Bytespider rule with Disallow: / may have no effect on the actual traffic, because the traffic is not identifying itself honestly. This is the same pattern as Perplexity's undeclared crawlers. To block ByteDance you need IP-level blocking, not robots.txt.

Applebot-Extended, like Google-Extended, is a data-usage directive rather than a separate crawler. Block it to keep your content out of Apple Intelligence while staying in Spotlight and Siri search.

7. Crawl Frequency Varies by an Order of Magnitude

Crawlers do not visit on the same schedule, and that matters for time-sensitive content:

Crawler Typical frequency Trigger
Googlebot Daily, for popular pages Continuous, priority-weighted
PerplexityBot Hours to days Query-triggered plus scheduled
OAI-SearchBot Days Scheduled
GPTBot Irregular, roughly monthly Scheduled, not documented
ClaudeBot Irregular Scheduled, not documented
CCBot Monthly Scheduled crawl cycle
Bytespider Unknown Undisclosed

The gap between Googlebot, daily, and CCBot, monthly, means a correction to your content reaches Google's index within days but may not reach the next Common Crawl training snapshot for weeks. For live-retrieval systems like Perplexity and ChatGPT, freshness is measured in hours. For training data, it is measured in months.

8. AI Crawler Traffic Is Growing Fast

Cloudflare's data for the twelve months to May 2025: overall crawler traffic grew 18 percent, while GPTBot alone grew 305 percent. About 14 percent of the top domains on the web now carry explicit AI-crawler rules in their robots.txt.

Most robots.txt files were written before AI crawlers existed. The default, no named-agent rules and a wildcard allow, lets every crawler in, including ones you may not mean to allow. That default is no longer neutral.

9. You Can Check Whether a Crawler Is Real

Any user-agent string can be faked. A request claiming to be GPTBot may not be OpenAI at all. The legitimate crawlers publish their IP ranges and support reverse DNS verification:

  • GPTBot publishes its IP ranges at https://openai.com/gptbot.json.
  • Googlebot: reverse DNS should resolve to googlebot.com or google.com.
  • PerplexityBot: reverse DNS should resolve under perplexitybot.com.
  • CCBot: Common Crawl documents the crawler and the address ranges it operates from.

The check is the same for all of them: take the source IP from your server logs, run a reverse DNS lookup, confirm the hostname matches the crawler's documented domain, then run a forward DNS lookup on that hostname and confirm it resolves back to the same IP.

Bytespider, by contrast, does not identify itself consistently and does not support reliable verification. Suspicious crawl volume from unverified sources is best treated as undeclared, not as a legitimate named bot.

10. The Decisions Are Independent, and Blocking a Retrieval Crawler Has a Cost

Each crawler, and each purpose, is a separate and independent decision. Blocking one does not touch any other:

Decision Bot to configure
Allow or block Google Search Googlebot
Allow or block Google AI training Google-Extended
Allow or block OpenAI training GPTBot
Allow or block ChatGPT Search OAI-SearchBot
Allow or block Anthropic training ClaudeBot
Allow or block Perplexity answers PerplexityBot
Allow or block Apple Intelligence training Applebot-Extended
Allow or block Amazon AI training Amazonbot
Allow or block the open training archive CCBot

Blocking a retrieval crawler means you disappear from that product's answers. Block PerplexityBot and people asking Perplexity questions will not see your content cited. Block OAI-SearchBot and your site drops out of ChatGPT Search. That is different from blocking a training crawler, where the cost is slower or absent representation in the model weights: less visible, but longer lasting.

Some common configurations.

Open to everything. No named-agent rules needed. User-agent: * with Allow: / covers every crawler.

Keep search and live retrieval, exit AI training only:

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Amazonbot
Disallow: /

Keep search only, exit all AI, training and retrieval both:

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

What This Means If You Are Building for AI

Opting out of training data does not make you invisible to AI. Live-retrieval crawlers will still visit your pages when users ask questions. The real question is whether the model answering has already seen a clean, structured version of your content, or is meeting it raw for the first time.

A model with good training data about your organization answers questions differently from one meeting your page cold. Both paths run on the same signals: explicit metadata, complete structured data, server-rendered content, accurate sitemaps.

robots.txt controls access. It does not control quality. Those are two different problems, and they need two different fixes.

The robots.txt You Should Actually Have

If your position is "open to all AI", which is the right call for most organizations, your robots.txt needs almost nothing:

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Add specific Disallow rules only for content you genuinely want excluded: private drafts, premium content, internal tools. Do not add rules out of vague worry about "AI scraping" without knowing which systems you are blocking and what you give up by doing it.

Review the file once a year at least. A robots.txt written in 2021 does not describe 2026.

Where MX Fits

All of this comes down to decisions about which signals to make explicit, and robots.txt is only the access layer. MX is the discipline of building digital experiences that are machine-readable by design, so no crawler, agent, or model has to guess what you meant. The access rules say who may read you; the structure decides what they actually understand. Both have to say the same thing.

There is a faster way to find out where you stand than reading your own robots.txt and hoping. An MX audit tests your live site the way these crawlers do: which agents can reach you, what robots.txt actually allows, whether your structured data and sitemaps hold up, and where a model or an agent would be left guessing about you. It reports back in a form you can verify yourself rather than take on trust.

If you want to know how your site looks to the crawlers in this post, start with an audit and I will walk you through what it finds.