Cross-Bot HTML Parity Audit — Context, Findings, and Considerations

Master technical document · May 2026 · PrerenderProxy

1 · What problem we're investigating

In 2024 Google formally deprecated dynamic rendering — the long-standing practice of serving a server-rendered HTML snapshot to crawlers while shipping a JavaScript SPA to humans. The official guidance since then is "use SSR or SSG; serve one HTML to everyone." But the 2025–2026 reality created a problem the deprecation didn't foresee:

AI crawlers do not render JavaScript. A Vercel analysis of 500 M+ bot fetches

in 2025 found that none of the major AI crawlers (GPTBot, ClaudeBot, PerplexityBot) execute JS. They request one HTML response, parse it as text, and move on.

Anti-bot edge products got aggressive. Cloudflare, Akamai, and AWS WAF

shipped one-click "Block AI Bots" presets in late 2024–2025. Many ecommerce sites enabled them without distinguishing between training crawlers (GPTBot, ClaudeBot, CCBot) and live-retrieval crawlers (ChatGPT-User, Claude-User, Perplexity-User) — even though only the latter directly affect product visibility in AI-driven shopping.

Dynamic rendering came back under new names. Edge prerendering via

Cloudflare Workers, Akamai EdgeWorkers, dedicated services like Prerender.io, and self-hosted stacks like PrerenderProxy all produce the same outcome: bots receive fully-formed HTML, humans receive the SPA. Google's deprecation did not erase the underlying problem; it just renamed the solution.

The question this audit answers: for the world's 30 most-visited and 100 largest e-commerce sites, what does each major search engine and AI crawler actually see when it requests the homepage today?

2 · The three layers of the audit

v1 — Top 30 general sites, 10 UAs, HTTP only

Directory: audit/2026-05/

Methodology: bare HTTP curl with each User-Agent, no JS execution, no proxy. For each (site, UA) we captured status, body length, <title>, <h1>, JSON-LD blocks, normalized visible text + SHA-256, and a naive text-length ratio vs the Chrome-desktop baseline.

Headline numbers:

16 / 30 sites are fully UA-neutral (Google, YouTube, Facebook, Amazon, Microsoft, etc.)
14 / 30 serve byte-identical HTML to at least one AI bot and Chrome
4 / 30 IP-walled everything from our Hetzner datacenter IP
7 / 30 specifically block AI bot UAs while serving Chrome
5 / 30 reject the canonical Googlebot/Bingbot UA from non-Google IPs

Most striking finding: baidu.com still serves Googlebot, Bingbot, GPTBot, ClaudeBot and PerplexityBot ~250 KB of fully pre-rendered HTML while Chrome gets a 357-character shell. Textbook dynamic rendering at the world's fourth-largest search engine.

v2 — Top 100 e-commerce, 8 UAs, residential baseline + ecommerce extractors

Directory: audit/2026-05-ecommerce-100/

What changed from v1:

**Chrome baseline came through a HasData residential proxy with full JS

rendering** instead of a bare curl from our datacenter. This eliminated the "all UAs blocked" rows that v1 produced for 4/30 sites — the reference is now what a real user in the US would see.

Added Chrome-direct as a second row to control for IP-friendliness.

Difference between Chrome-baseline (residential 200) and Chrome-direct (DC 403) isolates IP-block from UA-block.

Ecommerce-specific extractors: full JSON-LD @type walk (not just block

count), Product / Offer / BreadcrumbList detection, multi-currency price regex, product-pattern link count, hreflang count.

AI-readiness score 0–5 per UA: +1 status 200, +1 visible text > 500 chars,

+1 title present, +1 any JSON-LD, +1 Product/Offer schema or price or ≥5 product-shape links. Roll up per site as min(AI bot scores) for verdict.

Per-site verdict: ai_ready (min AI score ≥ 4) · partial (max 3–4) ·

thin (max ≤ 2) · ai_blocked (any AI bot hard-blocked) · unreachable (residential baseline itself failed).

Jaccard similarity on word tokens alongside SHA equality. A meaningful

"how similar" between 0 and 1, not a binary "same / not same".

Async harness with bounded concurrency: 6 sites in parallel, 16 direct

curls in flight, 3 HasData calls in flight. 100 sites × 8 UAs ≈ 800 fetches in ~6 minutes wall clock.

Retry once after 2 s on block. Cloudflare's responses are

non-deterministic and a single retry catches most of the false positives.

Headline numbers from v2:

metric	count
sites tested	100
fully AI-ready (every AI bot ≥ 4/5)	14
`partial` verdict (max 3–4)	6
`thin` verdict (max ≤ 2)	5
`ai_blocked` (≥ 1 AI bot hard-blocked)	62
`unreachable` (baseline failed)	13
IP-walled our datacenter while baseline OK	(subset of `partial` + `ai_blocked`)

The dominant pattern: 62/100 sites block exactly the same four AI bots together — GPTBot + ChatGPT-User + ClaudeBot + PerplexityBot. That homogeneity signals a one-click WAF preset, not 62 independent policy decisions.

Six sites surgically block only ClaudeBot — four of them are eBay properties (ebay.com, ebay.de, ebay.co.uk, kleinanzeigen.de), plus nike.com and canadiantire.ca. eBay has an explicit org-wide policy against Anthropic training; Nike and Canadian Tire are independent decisions.

v3 — 15 high-signal sites, real Chromium with 7 UAs

Directory: audit/2026-05-ecommerce-100/v3-data/ · Report: render-deepdive.html

What changed from v2:

Real headless Chromium (Playwright), full JS execution, 1280×900 viewport,

screenshot per (site, UA), networkidle wait or 6 s cap.

**UA matrix reduced to 7 bots that do render JS**: Chrome desktop,

Googlebot smartphone, Bingbot, Applebot, ChatGPT-User, Claude-User, Perplexity-User. (GPTBot and ClaudeBot — the training bots — were deliberately excluded because they don't render JS in production; testing them with a real browser would be misleading.)

15 hand-picked sites chosen for signal density: 5 single-bot blockers

from v2 + 4 dynamic-rendering candidates + 4 edge-blocked sites + 2 controls.

**Per-site verdict names the layer of discrimination**: edge-block holds /

CDN-only block (real browser bypasses) / bot-favored render / full parity.

What v3 revealed that v2 missed:

1. The "single ClaudeBot block" pattern is misleadingly narrow. v2 said ebay.com only blocks ClaudeBot; v3 shows that 0 / 6 bot UAs successfully render content at eBay when a real browser arrives. v2's HTTP probe completed with 200 quickly but the bodies were empty/challenge-class; v3 with full JS settle exposes that eBay rejects every JS-capable bot UA, not just ClaudeBot.

2. Dynamic rendering is alive and well in 2026 at major Asian retailers. - amazon.co.uk — Googlebot/Bingbot/Applebot UAs all receive the fully pre-rendered ~28 KB page; ChatGPT-User / Claude-User / Perplexity-User get 200–400-byte stubs. - coupang.com — Real Chrome from our IP gets 403. Googlebot and Bingbot UAs from the same IP get a clean 200 with 14 KB. They trust the UA without IP verification. - shopping.yahoo.co.jp — Real Chrome 403, Googlebot/Bingbot/Applebot get full content, AI bot UAs get 403. A three-tier policy: search bots whitelisted, real users IP-walled, AI bots blacklisted.

3. Samsung does fingerprint-based anti-cloaking. Playwright + Googlebot UA gets 403 from samsung.com, but Playwright + ChatGPT-User UA gets 200. They check the combination of UA + browser-automation traits. Real Googlebot from Google's WRS presumably gets through; a scraper pretending to be Googlebot via headless Chrome does not. The right defense.

4. Walmart is the reference implementation. All 7 UAs get identical ~1.5 KB rendered text, Jaccard ~1.0, byte-identical SHA. One SSR pipeline, no UA forking.

3 · Findings in detail

3.1 The 62-site WAF cluster

Sixty-two e-commerce sites — the dominant majority — block exactly the same combination of {GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot}. The exact fingerprint of the block (HTTP 403, ~150 KB Cloudflare interstitial, body contains cf-mitigated) suggests a single common ruleset, most likely the Cloudflare "Block AI Bots" managed rule or the equivalent presets in AWS WAF and Akamai Bot Manager.

The cost of this default-on posture: a ChatGPT user asking "what's a 65-inch OLED TV around $1000" will not get product results from Best Buy, Costco, Home Depot, Lowe's, Macy's, Kohl's, Sephora, IKEA, Mediamarkt, Saturn, MercadoLibre (multiple TLDs), John Lewis, Trendyol, Currys, or any of the other 50+ blocked sites. Those retailers are voluntarily invisible to live AI shopping queries.

The interesting nuance: the WAF preset doesn't distinguish between training and live retrieval. ClaudeBot is for training; Claude-User is for live retrieval. PerplexityBot is for indexing; Perplexity-User is for live answers. The four UAs blocked together represent two fundamentally different value exchanges — one is "your content will be used to train models you may never see revenue from", the other is "your product will be shown to a user right now who's asking about it." A reasonable policy would block the first and allow the second. The 62-site cluster blocks both indiscriminately.

3.2 The ClaudeBot vendetta

Six sites block exactly one AI bot — ClaudeBot — while allowing the other three. Four are eBay properties:

ebay.com, ebay.de, ebay.co.uk, kleinanzeigen.de (all eBay-owned)
nike.com
canadiantire.ca

eBay's policy is consistent across properties, which strongly suggests a corporate decision: Anthropic specifically is unwelcome; OpenAI and Perplexity are tolerated. Possible reasons: ongoing legal/business friction not in the public record; or a reading of Anthropic's published training practices as particularly problematic. Nike and Canadian Tire are independent decisions — no shared corporate structure — so the ClaudeBot-only block is also reachable through individual policy.

v3 contradicts the v2 narrative on these sites. When a real Chromium browser arrives at any of the eBay properties with any bot UA (not just ClaudeBot), the page fails to render. The HTTP probe returned a 200 because eBay's challenge layer responds fast and small; the meaningful content gating happens deeper. The "ClaudeBot-only" v2 finding is technically true at the HTTP layer and misleading at the user-experience layer.

3.3 Dynamic rendering revival

Five sites in v2 show bot UAs receiving substantially more content than the residential Chrome baseline:

Site	Chrome (residential)	Bot UA gets	Ratio
amazon.co.uk	~30 KB shell	~900 KB pre-rendered	30×
shopping.yahoo.co.jp	~12 KB	~95 KB	7–8×
canadiantire.ca	~6 KB	~36 KB	5.9×
coupang.com	small shell	14 KB	5.3×
uniqlo.com	shell	~5 KB	3×
nordstrom.com	normal	+50%	1.5×

This is the pattern Google formally asked to stop in 2024. The fact that Amazon is doing it at scale — at a 30× content multiplier on its UK property — is the clearest evidence that the industry never actually left dynamic rendering behind. They just stopped calling it that.

v3 with real Chromium confirms the pattern persists at the rendered layer: amazon.co.uk gives Googlebot/Bingbot/Applebot the pre-rendered 28 KB version while ChatGPT-User and Claude-User get a 200–400 byte stub. The discrimination isn't between "human and bot" — it's between trusted bots (Google, Bing, Apple) and AI bots (OpenAI, Anthropic, Perplexity). Amazon, at least in the UK, has an explicit allowlist of who gets the indexed version.

3.4 IP-walled sites

13 sites returned unreachable in v2 — even the residential proxy baseline got a hard 4xx or empty body. Examples: amazon.com (HTTP 202 challenge for everyone), wayfair.com, sahibinden.com, very.co.uk, catch.com.au, boots.com, jumia.com.ng, ao.com, bunnings.com.au, kogan.com.

For these sites our methodology cannot distinguish "blocks AI bots" from "blocks our specific IP set". Both v2 (direct from Hetzner) and the HasData residential proxy hit walls. The honest reading: these results are inconclusive and should not be cited as "block AI bots".

The fact that amazon.com IP-walls the same residential IP range that amazon.co.uk happily serves suggests Amazon's geo + IP policy varies per property — a wider audit with a per-property proxy pool would resolve it.

3.5 The well-architected reference cases

14 sites are fully AI-ready by our scoring:

walmart.com · rakuten.co.jp · target.com · trendyol.com · craigslist.org · alibaba.com · samsung.com · shein.com · apple.com · nordstrom.com · ulta.com · newegg.com · otto.de · decathlon.com

Common traits: single SSR pipeline, no UA forking, Product / Offer JSON-LD in the homepage, hreflang for international variants, no Cloudflare "Block AI Bots" rule. These sites get the same rendered HTML in front of every crawler, which is exactly what Google's 2024 guidance asked for — and ironically, the sites following that guidance are also the ones AI crawlers can index successfully today.

4 · Considerations and methodological limits

4.1 We test from one IP — Hetzner Helsinki

Our findings are filtered through one origin IP. Every "blocked" finding is strictly: "blocked when this UA arrives from this IP". A site that verifies bot UA via reverse-DNS against the official bot IP ranges (which is the correct behavior!) will refuse a Googlebot UA from us even though it would happily serve real Googlebot. This is well-documented for major search engines and we count it as a positive policy signal, not a "block".

The residential baseline (HasData US residential) mitigates this only for the Chrome reference row. Bot UA fetches still come from our IP. A more thorough audit would route each bot UA through an IP that's plausibly in that bot's range — but only the bot vendors themselves can do that.

4.2 Homepage only

We probe the homepage of each site. Homepages of marketplaces tend to be thin (just hero banners + nav). PDP (product detail pages) and category pages are where Product/Offer schema actually lives. A site that scores 2/5 on the homepage may still score 5/5 on a PDP. The audit is therefore a visibility-floor measurement, not a catalog-quality measurement.

A v4 audit could discover one PDP per site (crawl homepage for product link patterns, follow the first match) and probe that page in addition. Estimated effort: ~6 hours of harness work + an extra 5–10 minutes of crawl time. Worth doing for an "AI Shopping Visibility" rev.

4.3 GPTBot/ClaudeBot don't render JS

We test these training bots with the v2 HTTP probe — which is the right choice because they don't execute JS in production. We deliberately exclude them from v3 (Playwright). The findings about these bots therefore reflect what they actually fetch.

4.4 ChatGPT-User / Claude-User / Perplexity-User do render JS

These live-retrieval bots use real headless browsers in their production fleet (per public statements from OpenAI, Anthropic, Perplexity). v3 with Playwright is a reasonable approximation of what they see. The major caveat: their headless setups have specific fingerprints that real anti-bot products may recognize as "real X bot from real X IP" versus our "Playwright pretending to be X bot from Hetzner IP". Samsung's fingerprint-based discrimination (section 2.3 above) is the proof-of-concept that the difference is measurable.

4.5 HasData appends rather than replaces UA

We learned this during v1 toolkit testing: setting headers.User-Agent in HasData payload causes the API to append the requested UA to its default Chrome UA, not replace it. This means HasData cannot be used to test "what does the target site serve to UA X" — only "what does it serve to a Chrome+X concatenation, where the first match still wins". For Chrome baseline this is fine. For bot UA testing we must use direct curl, and a block is therefore a finding (the bot UA from a non-bot IP was rejected).

4.6 No retry on the cohort, just per-request

We retry each individual request once on block. We do not re-run the whole audit and average — Cloudflare's anti-bot decisioning is stable across minutes-to-hours, so a single run is representative within ~5 % noise. A longer-running monitoring use case would repeat the audit daily and chart the variance.

5 · What to do operationally with these findings

5.1 If you run an e-commerce site

Audit your own WAF preset. If you turned on "Block AI Bots" without

reading the rule, you are likely blocking both ClaudeBot (training, reasonable to block) and Claude-User (live retrieval, blocks revenue). Split them: allow the *-User family, block the training-only bots.

Verify with reverse-DNS, not UA. UA-only allowlists are spoofable

in 10 seconds. Use the published reverse-DNS hosts of each bot (.googlebot.com, .openai.com, *.anthropic.com, …) for the actual gate.

Don't dynamic-render in 2026. If your homepage takes 30× more

bytes to serve to Googlebot than to a real user, you are doing something Google asked you to stop. Migrate to SSR. PrerenderProxy exists precisely for sites that can't migrate quickly — use it as a bridge, not a destination.

Add Product / Offer / BreadcrumbList JSON-LD on PDPs. Five of

our top-100 AI-ready sites do this. AI shopping queries answered by ChatGPT and Perplexity cite the sites with the cleanest structured data first.

5.2 If you're operating an AI crawler

Publish your IP ranges and renderer behavior. OpenAI does this

best; Anthropic is catching up; Perplexity is still informal. Sites that want to allow you need a reliable way to verify, and UA alone doesn't work.

Honor robots.txt at every level, but recognize that many

e-commerce sites have permissive robots and aggressive WAF rules. The robots.txt allow is necessary but not sufficient.

Use distinguishing UAs for training vs live retrieval. OpenAI

and Anthropic already do this. The single biggest improvement in AI-shopping visibility would be sites understanding the difference.

5.3 If you're operating PrerenderProxy

The audit gives us a public testbed of 100 sites whose front-page

HTML we know across 8 UAs. We can re-run monthly and chart drift — a "did Cloudflare ship a rule change today?" canary.

The 14 fully-AI-ready sites are templates for what a well-architected

bot-parity setup looks like. We can compare a customer's HTML against walmart's or apple's as a reference and produce a one-page deltas report.

6 · File map

audit/2026-05/                             # v1 (top-30 general)
├── crawl.py · report.py · sites.json · uas.json · summary.json
├── data/<domain>/<ua>.html                # raw HTML per (site,UA)
└── index.html                             # public report

audit/2026-05-ecommerce-100/               # v2 (top-100 ecommerce)
├── crawl.py · report.py · sites.json · uas.json · summary.json
├── data/<domain>/<ua>.html
├── data/<domain>__summary.json
├── index.html                             # v2 public report
│
├── v3_render.py · v3_report.py            # v3 Playwright add-on
├── v3-data/v3-summary.json
├── v3-data/<domain>__v3.json
├── v3-data/shots/<domain>/<ua>.jpg        # screenshots
├── render-deepdive.html                   # v3 public report
│
├── docs/                                  # this set of master documents
│   ├── 01-context-findings-considerations.md
│   ├── 02-blog-post.md
│   ├── 03-case-study.md
│   └── 04-dynamic-rendering.md
└── bot/                                   # bot-by-bot reference
    ├── index.md · index.html
    └── <bot-slug>.md

7 · References

Vercel, "AI crawlers don't render JavaScript", 2025
Google, "Dynamic rendering deprecated", 2024
Anthropic, "Three-bot framework: ClaudeBot / Claude-User / Claude-SearchBot", 2025
OpenAI, "Overview of OpenAI Crawlers", developers.openai.com/api/docs/bots
Cloudflare, "Block AI Bots" managed ruleset
Similarweb / Semrush, "Top Ecommerce Websites — April 2026"