Bot Directory — Every common search and AI crawler

01 · The directory

Bot	Vendor	Category	Renders JS	robots.txt token	Honors robots.txt
GPTBot OpenAI's training crawler	OpenAI	MEMORY	HTTP only	`GPTBot`	yes
ClaudeBot Anthropic's training crawler	Anthropic	MEMORY	HTTP only	`ClaudeBot`	yes
CCBot The web's most-upstream training corpus	Common Crawl Foundation	MEMORY	HTTP only	`CCBot`	yes
Bytespider TikTok / Doubao training crawler	ByteDance	MEMORY	HTTP only	`Bytespider`	partial
Meta-ExternalAgent Llama training crawler	Meta	MEMORY	HTTP only	`Meta-ExternalAgent`	yes
Googlebot The reference search crawler	Google	SEARCH	RENDERS JS	`Googlebot`	yes
Bingbot Bing + Microsoft Copilot grounding	Microsoft	SEARCH	RENDERS JS	`bingbot`	yes
YandexBot Russia + CIS search	Yandex	SEARCH	partial JS	`Yandex (umbrella) · YandexBot (specific)`	yes
Baiduspider China's dominant search crawler	Baidu	SEARCH	HTTP only	`Baiduspider`	yes
DuckDuckBot Supplemental crawl; bulk results from Bing	DuckDuckGo	SEARCH	HTTP only	`DuckDuckBot`	yes
Brave Search Crawler Independent search index	Brave	SEARCH	RENDERS JS	`BraveBot`	yes
Applebot Siri / Spotlight / Safari suggestions	Apple	SEARCH	RENDERS JS	`Applebot`	yes
Yeti (Naver) South Korea's dominant search	Naver	SEARCH	partial JS	`Yeti`	yes
PetalBot Petal Search · Huawei device users	Huawei	SEARCH	RENDERS JS	`PetalBot`	yes
OAI-SearchBot ChatGPT Search index	OpenAI	SEARCH	HTTP only	`OAI-SearchBot`	yes
Claude-SearchBot Claude's web search index	Anthropic	SEARCH	HTTP only	`Claude-SearchBot`	yes
PerplexityBot Perplexity's answer-citation index	Perplexity AI	SEARCH	HTTP only	`PerplexityBot`	yes
ChatGPT-User ChatGPT live retrieval	OpenAI	FETCH	partial JS	`ChatGPT-User`	yes
Claude-User Claude live retrieval	Anthropic	FETCH	RENDERS JS	`Claude-User`	yes
Perplexity-User Perplexity live retrieval	Perplexity AI	FETCH	RENDERS JS	`Perplexity-User`	partial
Meta-ExternalFetcher Meta AI live retrieval	Meta	FETCH	partial JS	`Meta-ExternalFetcher`	yes
Amazonbot Alexa / Amazon Q answers	Amazon	FETCH	partial JS	`Amazonbot`	yes
facebookexternalhit Facebook / Instagram / WhatsApp link cards	Meta	SOCIAL	HTTP only	`facebookexternalhit`	partial
Twitterbot X / Twitter Card previews	X	SOCIAL	HTTP only	`Twitterbot`	partial
LinkedInBot LinkedIn link cards	LinkedIn	SOCIAL	HTTP only	`LinkedInBot`	partial
Slackbot Link Unfurler Slack link unfurl previews	Slack	SOCIAL	HTTP only	`Slackbot-LinkExpanding`	no
Google-Extended Gemini training opt-out (directive only)	Google	DIRECTIVE	—	`Google-Extended`	n/a
Applebot-Extended Apple Intelligence training opt-out (directive only)	Apple	DIRECTIVE	—	`Applebot-Extended`	n/a

02 · The split that matters most

Brand vs publisher — the two correct defaults

There is no single "right" robots.txt for AI bots. The answer flips depending on what kind of company you run. Treat the two paths below as defaults, then path-disallow your sensitive directories on top of whichever one fits.

FOR BRANDS · default open

Allow MEMORY, SEARCH and FETCH

You sell a product or service. Your goal is to be in the candidate set when someone asks an LLM a category question. Parametric training-time presence is what makes the model surface you; retrieval and search alone do not put you in the answer.

# Brand default · 2026
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
User-agent: Googlebot
User-agent: bingbot
Allow: /

# Still block: documented abusive crawlers
User-agent: Bytespider
Disallow: /

# Path-disallow sensitive areas
User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/
Disallow: /api/

The reasoning: Don't block GPTBot if you're a brand — the 2026 case for AI-memory presence.

FOR PUBLISHERS · default closed

Block MEMORY, allow SEARCH and FETCH

Your content is the product. You need negotiating leverage for licensing, EU DSM Article 4 opt-out preserved as a legal artifact, or evidence positioning in pending litigation. Blocking training is the load-bearing signal.

# Publisher default · 2026
# Opt out of LLM training (MEMORY)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# Allow search index + live retrieval (SEARCH + FETCH)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
Allow: /

The reasoning: Should you block AI bots? — the publisher decision framework.

A few rules that hold on either side. Reverse-DNS verification is mandatory — UA-only matching is spoofable in seconds. Every per-bot page below lists the vendor-published IP-range source, and the implementation recipes for nginx, Cloudflare, Fastly, Vercel, AWS and Apache are documented in their own post. PrerenderProxy ships this rDNS + IP-range allowlist natively — every classified-as-bot request is verified before it is served the rendered HTML, with no UA-only mode shipped. Most of the blocking that actually affects AI visibility today happens in WAF rules, not robots.txt, so audit the managed rule before assuming your robots.txt is the operative gate. And Bytespider goes on the blocklist either way: aggressive crawl patterns, no documented value exchange.

Categories at a glance

Brand vs publisher — the two correct defaults

Allow MEMORY, SEARCH and FETCH

Block MEMORY, allow SEARCH and FETCH