28 bots · verified May 2026

Bot Directory

Every common search, AI, and social-preview crawler that hits public web pages in 2026 — with its current User-Agent string, the URL where the vendor publishes its IP ranges, whether it executes JavaScript, and the robots.txt token you actually need. Categorised by what each bot does with your content (training, search index, live retrieval, link previews) so you can write a robots.txt that matches your policy without bundling categories that don't belong together.

Categories at a glance

5
MEMORY — training crawlers
12
SEARCH — index crawlers
5
FETCH — live retrieval
4
SOCIAL — preview bots
2
DIRECTIVES — opt-out tokens
01 · The directory
Category: Renders JS:
Bot Vendor Category Renders JS robots.txt token Honors robots.txt
GPTBot
OpenAI's training crawler
OpenAI MEMORY HTTP only GPTBot yes
ClaudeBot
Anthropic's training crawler
Anthropic MEMORY HTTP only ClaudeBot yes
CCBot
The web's most-upstream training corpus
Common Crawl Foundation MEMORY HTTP only CCBot yes
Bytespider
TikTok / Doubao training crawler
ByteDance MEMORY HTTP only Bytespider partial
Meta-ExternalAgent
Llama training crawler
Meta MEMORY HTTP only Meta-ExternalAgent yes
Googlebot
The reference search crawler
Google SEARCH RENDERS JS Googlebot yes
Bingbot
Bing + Microsoft Copilot grounding
Microsoft SEARCH RENDERS JS bingbot yes
YandexBot
Russia + CIS search
Yandex SEARCH partial JS Yandex (umbrella) · YandexBot (specific) yes
Baiduspider
China's dominant search crawler
Baidu SEARCH HTTP only Baiduspider yes
DuckDuckBot
Supplemental crawl; bulk results from Bing
DuckDuckGo SEARCH HTTP only DuckDuckBot yes
Brave Search Crawler
Independent search index
Brave SEARCH RENDERS JS BraveBot yes
Applebot
Siri / Spotlight / Safari suggestions
Apple SEARCH RENDERS JS Applebot yes
Yeti (Naver)
South Korea's dominant search
Naver SEARCH partial JS Yeti yes
PetalBot
Petal Search · Huawei device users
Huawei SEARCH RENDERS JS PetalBot yes
OAI-SearchBot
ChatGPT Search index
OpenAI SEARCH HTTP only OAI-SearchBot yes
Claude-SearchBot
Claude's web search index
Anthropic SEARCH HTTP only Claude-SearchBot yes
PerplexityBot
Perplexity's answer-citation index
Perplexity AI SEARCH HTTP only PerplexityBot yes
ChatGPT-User
ChatGPT live retrieval
OpenAI FETCH partial JS ChatGPT-User yes
Claude-User
Claude live retrieval
Anthropic FETCH RENDERS JS Claude-User yes
Perplexity-User
Perplexity live retrieval
Perplexity AI FETCH RENDERS JS Perplexity-User partial
Meta-ExternalFetcher
Meta AI live retrieval
Meta FETCH partial JS Meta-ExternalFetcher yes
Amazonbot
Alexa / Amazon Q answers
Amazon FETCH partial JS Amazonbot yes
facebookexternalhit
Facebook / Instagram / WhatsApp link cards
Meta SOCIAL HTTP only facebookexternalhit partial
Twitterbot
X / Twitter Card previews
X SOCIAL HTTP only Twitterbot partial
LinkedInBot
LinkedIn link cards
LinkedIn SOCIAL HTTP only LinkedInBot partial
Slackbot Link Unfurler
Slack link unfurl previews
Slack SOCIAL HTTP only Slackbot-LinkExpanding no
Google-Extended
Gemini training opt-out (directive only)
Google DIRECTIVE Google-Extended n/a
Applebot-Extended
Apple Intelligence training opt-out (directive only)
Apple DIRECTIVE Applebot-Extended n/a
02 · The split that matters most

Brand vs publisher — the two correct defaults

There is no single "right" robots.txt for AI bots. The answer flips depending on what kind of company you run. Treat the two paths below as defaults, then path-disallow your sensitive directories on top of whichever one fits.

FOR BRANDS · default open

Allow MEMORY, SEARCH and FETCH

You sell a product or service. Your goal is to be in the candidate set when someone asks an LLM a category question. Parametric training-time presence is what makes the model surface you; retrieval and search alone do not put you in the answer.

# Brand default · 2026
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Meta-ExternalAgent
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
User-agent: Googlebot
User-agent: bingbot
Allow: /

# Still block: documented abusive crawlers
User-agent: Bytespider
Disallow: /

# Path-disallow sensitive areas
User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /checkout/
Disallow: /api/

The reasoning: Don't block GPTBot if you're a brand — the 2026 case for AI-memory presence.

FOR PUBLISHERS · default closed

Block MEMORY, allow SEARCH and FETCH

Your content is the product. You need negotiating leverage for licensing, EU DSM Article 4 opt-out preserved as a legal artifact, or evidence positioning in pending litigation. Blocking training is the load-bearing signal.

# Publisher default · 2026
# Opt out of LLM training (MEMORY)
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# Allow search index + live retrieval (SEARCH + FETCH)
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Perplexity-User
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
Allow: /

The reasoning: Should you block AI bots? — the publisher decision framework.

A few rules that hold on either side. Reverse-DNS verification is mandatory — UA-only matching is spoofable in seconds. Every per-bot page below lists the vendor-published IP-range source, and the implementation recipes for nginx, Cloudflare, Fastly, Vercel, AWS and Apache are documented in their own post. PrerenderProxy ships this rDNS + IP-range allowlist natively — every classified-as-bot request is verified before it is served the rendered HTML, with no UA-only mode shipped. Most of the blocking that actually affects AI visibility today happens in WAF rules, not robots.txt, so audit the managed rule before assuming your robots.txt is the operative gate. And Bytespider goes on the blocklist either way: aggressive crawl patterns, no documented value exchange.