Should you block AI bots? A 3-step decision framework

The single most common configuration mistake we see in 2026 is sites that meant to opt out of AI training data and ended up opting out of AI shopping referrals at the same time. The Cloudflare "Block AI Bots" preset — and its AWS/Akamai equivalents — bundle three categorically different bots into one rule. That is the source of most of the damage.

Here is the three-step framework we apply every time a customer asks us "should we block AI bots".

Step 1 — Recognize that "AI bot" is three different things

Every serious AI vendor in 2026 operates three distinct crawler categories, each with its own User-Agent token and its own role:

Three categories. Only the first is "AI training" in the sense most blog posts mean. The other two bring you customers.

OpenAI splits these as GPTBot / OAI-SearchBot / ChatGPT-User. Anthropic splits them as ClaudeBot / Claude-SearchBot / Claude-User. Perplexity has PerplexityBot and Perplexity-User. Apple has Applebot for search and the Applebot-Extended directive for the training opt-out. Meta has the same shape with Meta-ExternalAgent for training and Meta-ExternalFetcher for live retrieval. Google uses Googlebot for search and the Google-Extended directive for training opt-out.

If your policy is "we don't want our content used to train large language models", you have a clean recipe: block the training tokens, leave the others alone. If your policy is "we don't want LLMs to know we exist at all", that is also a clean policy — it just costs you the live-retrieval and search-index pathways too.

The Cloudflare "Block AI Bots" preset and most of its peers bundle all three. That is why so many of the audited 62 sites have made themselves invisible to AI shopping while believing they were only opting out of training.

Step 2 — Decide what each category is worth to you

You answer three questions, in order, with numbers if you can:

What is the marginal value of training data? For most sites the answer is approximately zero direct value, and an unmeasurable indirect value via brand awareness once your content shapes the next model. The cost of allowing training is opportunity cost, not server cost — GPTBot fetches once and moves on.

What is the value of being indexed by AI search? ChatGPT Search, Perplexity, Claude's web tool, and now Apple Intelligence all build their grounding indexes from the search-class crawlers. If a customer asks any of those products a question that your products could answer, the index entry is the only way you appear. The value is the same as classical SEO, with a smaller addressable market today and a larger one each quarter.

What is the value of live retrieval? This is the only category that maps directly to revenue today. When a ChatGPT user asks "find me a 65-inch OLED TV under $1500", ChatGPT-User is the bot that goes and reads the retailer pages to answer. Block it and you do not appear in the answer. Period.

The mistake in the 62-site cluster is treating all three categories at the value of category one and getting the cost of category three.

Step 3 — Write the robots.txt and verify it on the wire

For most e-commerce sites the right policy in 2026 is:

# Opt out of training
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: PerplexityBot
User-agent: CCBot
User-agent: Bytespider
User-agent: Meta-ExternalAgent
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# Allow search-index & live retrieval
User-agent: OAI-SearchBot
User-agent: Claude-SearchBot
User-agent: Perplexity-User
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Meta-ExternalFetcher
User-agent: Amazonbot
User-agent: Applebot
Allow: /

This is necessary, not sufficient. Most of the blocking that affects AI shopping visibility today happens at the WAF layer, not the robots.txt layer. Two extra steps are required:

Audit your Cloudflare / AWS WAF / Akamai rule set. If "Block AI Bots / Scrapers" is on as a managed rule, find the underlying list of UAs and split it the same way as your robots.txt. Most CDN admins didn't notice the rule blocks live-retrieval bots; explicit splitting fixes that.
Verify the verification logic. A site that allows ChatGPT-User via UA only is allowing every scraper that sets the right header. The right check is reverse-DNS: the requesting IP must resolve to *.openai.com (forward-DNS check confirms it back). Every major AI vendor publishes their reverse-DNS pattern. We document each one in our bot catalog.

The edge case worth knowing about

One vendor in our catalog has a controversial enforcement record: Perplexity. In mid-2024 Wired reported that the company fetched content from sites that had explicitly disallowed PerplexityBot by routing through residential IPs with a generic Chrome UA. Perplexity now asks sites to allow PerplexityBot (its index crawler) in robots.txt, but its own documentation states that the user-triggered Perplexity-User fetcher "generally ignores robots.txt rules" — and compliance remains contested as of 2026: in August 2025 Cloudflare reported that Perplexity was using undeclared "stealth" crawlers with a generic Chrome user agent to bypass robots.txt and WAF blocks, and de-listed it as a verified bot (Perplexity disputed the findings). If you specifically do not want Perplexity to fetch you, the reverse-DNS check on Perplexity-User is the actual gate; the robots.txt directive is only the policy statement.

Anthropic's ClaudeBot consolidated the older anthropic-ai and Claude-Web tokens in July 2024 — older robots.txt files that block those names should be updated to use ClaudeBot. Apple's Applebot-Extended is not a separate crawler; it's a robots.txt token that opts out of Apple Intelligence training while keeping Siri / Spotlight visibility intact. Google-Extended does the same for Gemini training while keeping Google Search.

The PrerenderProxy default

When we deploy PrerenderProxy for a customer, the default policy ships with: training tokens blocked, search-index and live-retrieval tokens allowed, reverse-DNS verification mandatory, the bot version of every page rendered from the exact same Next.js / Nuxt / Vue / Angular component tree the customer would hydrate, drift between bot HTML and user HTML alarmed in Grafana, and the audit data from the 100-site cohort included as a starting benchmark. That is the policy this post argues for, made operational.

Companion reading: 62 of 100 — the audit findings · Bot reference catalog