Every common search, AI, and social-preview crawler that hits public web pages in 2026 — with its current User-Agent string, the URL where the vendor publishes its IP ranges, whether it executes JavaScript, and the robots.txt token you actually need. Categorised by what each bot does with your content (training, search index, live retrieval, link previews) so you can write a robots.txt that matches your policy without bundling categories that don't belong together.
| Bot | Vendor | Category | Renders JS | robots.txt token | Honors robots.txt |
|---|---|---|---|---|---|
| GPTBot OpenAI's training crawler |
OpenAI | MEMORY | HTTP only | GPTBot |
yes |
| ClaudeBot Anthropic's training crawler |
Anthropic | MEMORY | HTTP only | ClaudeBot |
yes |
| CCBot The web's most-upstream training corpus |
Common Crawl Foundation | MEMORY | HTTP only | CCBot |
yes |
| Bytespider TikTok / Doubao training crawler |
ByteDance | MEMORY | HTTP only | Bytespider |
partial |
| Meta-ExternalAgent Llama training crawler |
Meta | MEMORY | HTTP only | Meta-ExternalAgent |
yes |
| Googlebot The reference search crawler |
SEARCH | RENDERS JS | Googlebot |
yes | |
| Bingbot Bing + Microsoft Copilot grounding |
Microsoft | SEARCH | RENDERS JS | bingbot |
yes |
| YandexBot Russia + CIS search |
Yandex | SEARCH | partial JS | Yandex (umbrella) · YandexBot (specific) |
yes |
| Baiduspider China's dominant search crawler |
Baidu | SEARCH | HTTP only | Baiduspider |
yes |
| DuckDuckBot Supplemental crawl; bulk results from Bing |
DuckDuckGo | SEARCH | HTTP only | DuckDuckBot |
yes |
| Brave Search Crawler Independent search index |
Brave | SEARCH | RENDERS JS | BraveBot |
yes |
| Applebot Siri / Spotlight / Safari suggestions |
Apple | SEARCH | RENDERS JS | Applebot |
yes |
| Yeti (Naver) South Korea's dominant search |
Naver | SEARCH | partial JS | Yeti |
yes |
| PetalBot Petal Search · Huawei device users |
Huawei | SEARCH | RENDERS JS | PetalBot |
yes |
| OAI-SearchBot ChatGPT Search index |
OpenAI | SEARCH | HTTP only | OAI-SearchBot |
yes |
| Claude-SearchBot Claude's web search index |
Anthropic | SEARCH | HTTP only | Claude-SearchBot |
yes |
| PerplexityBot Perplexity's answer-citation index |
Perplexity AI | SEARCH | HTTP only | PerplexityBot |
yes |
| ChatGPT-User ChatGPT live retrieval |
OpenAI | FETCH | partial JS | ChatGPT-User |
yes |
| Claude-User Claude live retrieval |
Anthropic | FETCH | RENDERS JS | Claude-User |
yes |
| Perplexity-User Perplexity live retrieval |
Perplexity AI | FETCH | RENDERS JS | Perplexity-User |
partial |
| Meta-ExternalFetcher Meta AI live retrieval |
Meta | FETCH | partial JS | Meta-ExternalFetcher |
yes |
| Amazonbot Alexa / Amazon Q answers |
Amazon | FETCH | partial JS | Amazonbot |
yes |
| facebookexternalhit Facebook / Instagram / WhatsApp link cards |
Meta | HTTP only | facebookexternalhit |
partial | |
| Twitterbot X / Twitter Card previews |
X | HTTP only | Twitterbot |
partial | |
| LinkedInBot LinkedIn link cards |
HTTP only | LinkedInBot |
partial | ||
| Slackbot Link Unfurler Slack link unfurl previews |
Slack | HTTP only | Slackbot-LinkExpanding |
no | |
| Google-Extended Gemini training opt-out (directive only) |
DIRECTIVE | — | Google-Extended |
n/a | |
| Applebot-Extended Apple Intelligence training opt-out (directive only) |
Apple | DIRECTIVE | — | Applebot-Extended |
n/a |
There is no single "right" robots.txt for AI bots. The answer flips depending on what kind of company you run. Treat the two paths below as defaults, then path-disallow your sensitive directories on top of whichever one fits.
You sell a product or service. Your goal is to be in the candidate set when someone asks an LLM a category question. Parametric training-time presence is what makes the model surface you; retrieval and search alone do not put you in the answer.
# Brand default · 2026
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
User-agent: Googlebot
User-agent: bingbot
Allow: /
# Still block: documented abusive crawlers
User-agent: Bytespider
Disallow: /
# Path-disallow sensitive areas
User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/
Disallow: /api/
The reasoning: Don't block GPTBot if you're a brand — the 2026 case for AI-memory presence.
Your content is the product. You need negotiating leverage for licensing, EU DSM Article 4 opt-out preserved as a legal artifact, or evidence positioning in pending litigation. Blocking training is the load-bearing signal.
# Publisher default · 2026
# Opt out of LLM training (MEMORY)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /
# Allow search index + live retrieval (SEARCH + FETCH)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
Allow: /
The reasoning: Should you block AI bots? — the publisher decision framework.
A few rules that hold on either side. Reverse-DNS verification is mandatory — UA-only matching is spoofable in seconds. Every per-bot page below lists the vendor-published IP-range source, and the implementation recipes for nginx, Cloudflare, Fastly, Vercel, AWS and Apache are documented in their own post. PrerenderProxy ships this rDNS + IP-range allowlist natively — every classified-as-bot request is verified before it is served the rendered HTML, with no UA-only mode shipped. Most of the blocking that actually affects AI visibility today happens in WAF rules, not robots.txt, so audit the managed rule before assuming your robots.txt is the operative gate. And Bytespider goes on the blocklist either way: aggressive crawl patterns, no documented value exchange.