What Most robots.txt Guides Get Wrong About AI Crawlers

10 June 2026 · Tom Cranstoun · 10 min read

In my last post I said most content teams are watching the wrong thing. They add an llms.txt, they argue about whether to block GPTBot, and they check if they turn up in ChatGPT. This post is about the middle one: the GPTBot argument, and why it's almost always built on a wrong picture of how AI crawlers actually work.

Most robots.txt guidance treats AI crawlers as a single category. Block them, or allow them. That framing is wrong, and acting on it costs you either search visibility or control over your training data, sometimes both.

This is the second post in a three-part series on how AI systems read the web. The first, Your Site Is Already Training AI Models, covered Common Crawl, the open archive nearly every model draws its training data from. That one was about the archive; this one is about the live crawlers and the file most people think controls them. The third, What AI Crawlers See When They Can't Run Your JavaScript, is about what those crawlers actually read once they reach your page.

Here's what the guides miss.

1. Only Googlebot Renders JavaScript

Googlebot renders JavaScript. It processes your page the way a browser would: it waits for the DOM to settle, then indexes what a user would actually see.

Every other major AI crawler, CCBot, GPTBot, ClaudeBot, PerplexityBot, doesn't. They fetch the raw HTML response and process that. If your content is rendered client-side, injected by a framework after load, or lazy-loaded on scroll, it is invisible to all of them.

This bites hardest for:

JSON-LD injected by JavaScript. Only the server-rendered version is safe.
Text loaded by fetch or XHR after the page loads.
Metadata populated by a CMS front-end in the browser.

The fix is server-side rendering for anything you want AI systems to see. If it's not in the raw HTML response, it doesn't exist for a training crawler.

2. Google Runs Three Crawlers, Doing Three Different Jobs

Googlebot is the main crawler. It fetches pages for the search index, renders JavaScript, and visits often. Block it and you disappear from Google Search.

Google-Extended isn't a crawler at all. It's a data-usage directive: a signal to Google's internal pipeline not to use your content for Gemini and its AI training products. The content is still fetched, by Googlebot. You can block Google-Extended and keep Googlebot, and your search rankings are untouched.

User-agent: Google-Extended
Disallow: /

That one line keeps you in Google Search and out of Gemini training. Most publishers don't know the option exists.

APIs-Google handles structured-data testing tools and developer requests. Leave it alone.

3. OpenAI Runs Three Crawlers, and One Ignores Your robots.txt

GPTBot is training data for GPT models. It fully respects robots.txt.

OAI-SearchBot powers ChatGPT Search. It also respects robots.txt, but independently of GPTBot. Block GPTBot and not OAI-SearchBot and you're out of training data but still in ChatGPT's search results. Block both and you're out of both.

ChatGPT-User fires when a real person asks ChatGPT something that makes it visit your page. OpenAI's own documentation says: "because these actions are initiated by a user, robots.txt rules may not apply." It behaves the way a person with a browser would. So you can't rely on robots.txt to block all AI access to your content.

4. Perplexity Splits the Same Way, and Then Went Further

PerplexityBot indexes content for Perplexity's search results. Not training. It respects robots.txt. Block it and you vanish from Perplexity answers.

Perplexity-User is user-triggered. When someone asks Perplexity a question and it fetches your page to answer, this is the agent that arrives. Robots.txt may not apply, and Perplexity documents that openly.

There's a harder problem here. In 2024, Cloudflare reported that Perplexity was using stealth, undeclared crawlers: systematically changing user-agent strings and rotating IPs and ASNs to get around robots.txt and the firewall rules that site owners had put in place. Customers had blocked both declared Perplexity bots, in robots.txt and at the firewall. Perplexity was still pulling their content using unidentified crawlers.

Perplexity's position was that Perplexity-User is user-triggered and therefore exempt. Cloudflare's investigation read that as cover for automated crawling under false identities.

The practical takeaway: robots.txt isn't a reliable barrier against Perplexity. If exclusion genuinely matters to you, you need server-side access controls.

5. Anthropic Follows the Same Pattern as OpenAI

ClaudeBot is Anthropic's training crawler. It respects robots.txt. This is what shapes what Claude models know about your site.

Claude-SearchBot powers Claude's live search and retrieval. It's separate from ClaudeBot, so you can block training without blocking retrieval, or the reverse.

Claude-User is user-triggered browsing from Claude. Same caveat as ChatGPT-User: a real person started it, so robots.txt may not apply.

The opt-out decisions for Anthropic, OpenAI, and Google are all independent of each other. Blocking one does nothing to the others.

6. There Are Far More Crawlers Than the Five Everyone Names

The field is broader than the headline names. These are the ones worth having in your robots.txt if you're making explicit decisions:

Crawler	Operator	Purpose	Respects robots.txt
`GPTBot`	OpenAI	ChatGPT training	Yes
`OAI-SearchBot`	OpenAI	ChatGPT Search	Yes
`ClaudeBot`	Anthropic	Claude training	Yes
`PerplexityBot`	Perplexity	Perplexity answers	Yes (declared)
`CCBot`	Common Crawl	Open training archive	Yes
`Google-Extended`	Google	Gemini training directive	N/A (not a crawler)
`Applebot-Extended`	Apple	Apple Intelligence	N/A (not a crawler)
`Amazonbot`	Amazon	AWS AI / training	Yes
`Meta-ExternalAgent`	Meta	Llama training	Vendor-dependent
`Bytespider`	ByteDance	TikTok / AI training	No, masquerades as other agents
`cohere-ai`	Cohere	Enterprise AI training	Yes
`Diffbot`	Diffbot	Structured-data extraction	Varies

Bytespider deserves a specific warning. ByteDance's crawler has been seen rotating user-agent strings to pose as Chrome, Safari, or another browser. A User-agent: Bytespider rule with Disallow: / may have no effect on the actual traffic, because the requests aren't identifying themselves as Bytespider. This is the same pattern as Perplexity's undeclared crawlers. To block ByteDance you need IP-level blocking, not robots.txt.

Applebot-Extended, like Google-Extended, is a data-usage directive rather than a separate crawler. Block it to keep your content out of Apple Intelligence while staying in Spotlight and Siri search.

7. Crawl Frequency Varies by an Order of Magnitude

Crawlers don't visit on the same schedule, and that matters for time-sensitive content:

Crawler	Typical frequency	Trigger
Googlebot	Daily, for popular pages	Continuous, priority-weighted
PerplexityBot	Hours to days	Query-triggered plus scheduled
OAI-SearchBot	Days	Scheduled
GPTBot	Irregular, roughly monthly	Scheduled, not documented
ClaudeBot	Irregular	Scheduled, not documented
CCBot	Monthly	Scheduled crawl cycle
Bytespider	Unknown	Undisclosed

The gap between Googlebot (daily) and CCBot (monthly) means a correction to your content reaches Google's index within days but may not reach the next Common Crawl training snapshot for weeks. For live-retrieval systems like Perplexity and ChatGPT, freshness is measured in hours. For training data, it's measured in months.

8. AI Crawler Traffic Is Growing Fast

Cloudflare's data for the twelve months to May 2025: overall crawler traffic grew 18 percent, while GPTBot alone jumped 305 percent. About 14 percent of the top domains on the web now carry explicit AI-crawler rules in their robots.txt.

Most robots.txt files were written before AI crawlers existed. The default - no named-agent rules and a wildcard allow - lets every crawler in, including ones you may not intend to permit. That default is no longer neutral.

9. You Can Check Whether a Crawler Is Real

Any user-agent string can be faked. A request claiming to be GPTBot may not be OpenAI at all. The legitimate crawlers publish their IP ranges and support reverse DNS verification:

GPTBot publishes its IP ranges at https://openai.com/gptbot.json.
Googlebot: reverse DNS should resolve to googlebot.com or google.com.
PerplexityBot: reverse DNS should resolve under perplexitybot.com.
CCBot: Common Crawl documents the crawler and its documented address ranges.

The check is the same for all of them: take the source IP from your server logs, run a reverse DNS lookup, verify the hostname matches the crawler's documented domain, then run a forward DNS lookup on that hostname and check it resolves back to the same IP.

Bytespider, by contrast, doesn't identify itself consistently and doesn't support reliable verification. Suspicious crawl volume from unverified sources is best treated as undeclared, not as a legitimate named bot.

10. The Decisions Are Independent, and Blocking a Retrieval Crawler Has a Cost

Each crawler, and each purpose, is a separate and independent decision. Blocking one doesn't touch any other:

Decision	Bot to configure
Allow or block Google Search	`Googlebot`
Allow or block Google AI training	`Google-Extended`
Allow or block OpenAI training	`GPTBot`
Allow or block ChatGPT Search	`OAI-SearchBot`
Allow or block Anthropic training	`ClaudeBot`
Allow or block Perplexity answers	`PerplexityBot`
Allow or block Apple Intelligence training	`Applebot-Extended`
Allow or block Amazon AI training	`Amazonbot`
Allow or block the open training archive	`CCBot`

Blocking a retrieval crawler means you disappear from that product's answers. Block PerplexityBot and people asking Perplexity questions won't see your content cited. Block OAI-SearchBot and your site drops out of ChatGPT Search. That's a distinct trade-off from blocking a training crawler, where the cost is slower or absent representation in the model weights: less visible, but longer lasting.

Some common configurations.

Open to everything. No named-agent rules needed. User-agent: * with Allow: / covers every crawler.

Keep search and live retrieval, exit AI training only:

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Amazonbot
Disallow: /

Keep search only, exit all AI, training and retrieval both:

User-agent: Google-Extended
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

What This Means If You're Building for AI

Opting out of training data doesn't make you invisible to AI. Live-retrieval crawlers will still visit your pages when users ask questions. The real question is whether the model answering has already seen a clean, structured version of your content, or is meeting it raw for the first time.

A model with good training data about your site answers questions differently from one meeting your page cold. Both paths run on the same signals: explicit metadata, complete structured data, server-rendered content, accurate sitemaps.

robots.txt controls access. It doesn't control quality. Those are two separate problems, and each needs its own fix.

The robots.txt You Should Actually Have

If your position is "open to all AI", which is the right call for most companies, your robots.txt needs almost nothing:

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Add specific Disallow rules only for pages you genuinely want excluded: private drafts, premium content, internal tools. Don't add rules out of vague worry about "AI scraping" without knowing which systems you're blocking and what you give up by doing it.

Review the file once a year at least. A robots.txt written in 2021 doesn't describe 2026.

Where MX Fits

All of this is about decisions on which signals to make explicit, and robots.txt is only the access layer. MX is the discipline of building digital experiences that are machine-readable by design, so no crawler, agent, or model has to guess what you meant. The access rules say who may read you; the structure decides what they actually understand. Both have to say the same thing.

There's a faster way to find out where you stand than reading your own robots.txt and hoping. An MX audit tests your live site the way these crawlers do: which agents can reach you, what robots.txt actually allows, whether your structured data and sitemaps hold up, and where a model or an agent would be left guessing about you. It reports back in a form you can verify yourself rather than take on trust.

If you want to know how your site looks to the crawlers in this post, start with an audit and I'll walk you through what it finds.