Audit documents

Companion writing to the May 2026 cross-bot audit — technical findings, blog post, case study, dynamic-rendering summary.

← docs index · audit index

Case study · AI-bot visibility across the world's 100 largest e-commerce properties

May 2026 · PrerenderProxy · public dataset and methodology


Executive summary

We audited the homepages of the 100 largest e-commerce sites in the world against ten different user-agent headers — Chrome desktop and mobile, Googlebot, Bingbot, GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Applebot — first via direct HTTP probes, then for fifteen high-signal cases via a real headless Chromium browser. The objective was to answer one question:

> When an AI shopping assistant looks up a product, does the retailer's > front page actually serve content that the AI can read?

Topline:

reached them got a meaningful HTML page with parseable content.

block pattern is strikingly uniform: 62 sites refuse the same four user-agents (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot) with what appears to be the same managed WAF ruleset.

properties, plus nike.com and canadiantire.ca.

3–30× more pre-rendered HTML than they serve a real user. Amazon UK is the most extreme at ~30×.

residential proxies alike.

The aggregate picture: a majority of large e-commerce sites have voluntarily reduced their visibility to AI-driven shopping queries through 2026, while continuing to invest in classical SEO infrastructure that AI bots could read just as easily if they were allowed in.


1 · Methodology

1.1 Sample frame

The site list combines Similarweb's "Ecommerce & Shopping" category ranking (April 2026 snapshot) for ranks 1–50 with a curated second half (51–100) covering the major fashion, electronics, home, department-store, beauty, sports, and regional ecommerce leaders not represented in the first half. Full list and rationale: sites.json in the public repository.

The list is geographically distributed (US, UK, DE, FR, IT, ES, NL, PL, RU, TR, JP, KR, CN, IN, SE Asia, Latin America, Australia, Africa) and verticals include marketplace (45), classifieds (6), fashion (9), electronics (8), home (7), department (6), beauty (4), sports (4), big-box (5), grocery (1), tickets (1), health (1), books-electronics (1).

1.2 User-agent matrix

Three groups:

reference fetch came through a HasData residential US proxy with full JavaScript execution, simulating a real user browsing from a US residential IP. A second Chrome row used a direct HTTP curl from our Hetzner datacenter, controlling for IP-friendliness.

(training), PerplexityBot (training/index).

For 15 hand-picked sites we additionally launched a real headless Chromium with each of: Googlebot, Bingbot, Applebot, ChatGPT-User, Claude-User, Perplexity-User. (The training-only bots — GPTBot, ClaudeBot — were excluded from the rendered run because they do not execute JavaScript in production.)

1.3 Per-fetch instrumentation

For every (site, UA) cell we captured:

collapsed)

baseline

rate-limit) — explicit None if the response looked normal

1.4 AI-readiness score 0–5

We rolled the per-cell metrics into a single integer score for each (site, UA):

conditionadds
HTTP 200 with > 1 KB body+1
visible text > 500 characters+1
<title> present and ≥ 3 chars+1
any JSON-LD block present+1
has Product/Offer schema OR visible price OR ≥ 5 product-shaped <a href>+1

Per-site verdict from the four AI-bot scores:

1.5 Asynchrony and concurrency

The harness is asyncio with bounded concurrency: 6 sites in parallel at the outer loop, 16 direct curls in flight at any moment, 3 HasData residential calls in flight (the chokepoint). 100 sites × 8 UAs ≈ 800 fetches completed in 6 minutes wall-clock.

1.6 Public reproducibility

All code (crawl.py, report.py, v3_render.py, v3_report.py), inputs (sites.json, uas.json), and outputs (per-site JSON + saved HTML bodies + screenshots) are public under /srv/prerenderproxy/audit/2026-05-ecommerce-100/ and served at https://prerenderproxy.com/audit/2026-05-ecommerce-100/.


2 · Findings by verdict cohort

2.1 The AI-ready cohort (n = 14)

walmart.com · rakuten.co.jp · target.com · trendyol.com · craigslist.org · alibaba.com · samsung.com · shein.com · apple.com · nordstrom.com · ulta.com · newegg.com · otto.de · decathlon.com

These sites pass every AI-readiness criterion. Common architectural traits:

the normalized visible text matches across UAs (or differs only in cache-busting tokens and analytics IDs).

The same Next.js / Nuxt / custom Node stack that ships HTML to browsers is what bots see.

Product / Offer schema reserved for PDPs (which we did not test, but inference from sitemap structure is high-confidence).

consciously evaluated and rejected the rule, use a different CDN, or were never tempted.

Vertical breakdown of AI-ready cohort: marketplace 5 · big-box 1 · department 1 · beauty 1 · electronics 3 · fashion 1 · sports 1 · classifieds 1.

2.2 The blocked cohort (n = 62)

We do not list all 62 here — the heatmap on /audit/2026-05-ecommerce-100/ shows them. The high-traffic names are notable:

sephora, wayfair (also IP-walled), nike (technically ClaudeBot-only in v2; broader in v3)

adidas.com

mercadolibre.com.ar, shopee (.com.br, .co.id, .vn, .co.th, .sg), olx.com.br

argos.co.uk

shopping.yahoo.co.jp (also serves bot UAs more content)

The strong recurring response pattern: HTTP 403 with ~150 KB body containing a cf-mitigated header or the literal challenge HTML. This is Cloudflare's default response when its "Block AI Bots" managed rule fires. The same body shape returns at AWS WAF and Akamai when their equivalents fire.

Important nuance from v3. Of the 5 v3 sites where v2 flagged "AI blocked", 3 (amazon.co.uk, coupang.com, shopping.yahoo.co.jp, uniqlo.com) showed partial render success when probed with a real Chromium browser: search bots got through (Googlebot/Bingbot/Applebot UAs received content), AI bots did not (ChatGPT-User/Claude-User/ Perplexity-User got 403 or empty). The block is therefore a search-vs-AI discrimination, not a universal anti-bot policy.

2.3 The dynamic rendering cohort (n = 5)

Sites where bot UAs receive substantially more pre-rendered content than the residential Chrome baseline:

siteChrome (residential)bot UAratio
amazon.co.uk~30 KB shellGooglebot/Bingbot ~900 KB30×
shopping.yahoo.co.jpthinGooglebot/Bingbot ~95 KB7–8×
canadiantire.casmallAI bots ~36 KB5.9×
coupang.comshell + 403 to direct ChromeGooglebot 14 KB5.3×
uniqlo.comsmallGooglebot ~5 KB

All five are practicing the technique Google formally deprecated in 2024. The v3 rendered audit confirms that for amazon.co.uk and shopping.yahoo.co.jp the discrimination persists at the rendered DOM layer — these are not browser-rendering artifacts. Coupang and shopping.yahoo.co.jp additionally IP-wall real Chrome from our datacenter while trusting the Googlebot UA from the same IP. That is the unsafe UA-only allowlist pattern: a scraper that sets the right header gets through; a legitimate user from a VPN does not.

2.4 The ClaudeBot-only cohort (n = 6)

ebay.com · ebay.de · ebay.co.uk · kleinanzeigen.de · nike.com · canadiantire.ca

Six sites block exactly one AI bot — ClaudeBot — while their HTTP responses to GPTBot, ChatGPT-User, and PerplexityBot are 200 OK. The four eBay properties are organizationally consistent (kleinanzeigen is wholly eBay-owned). Nike and Canadian Tire are independent.

v3 (rendered, fifteen-site deep-dive) found that all five of these sites we re-tested with a real Chromium browser actually fail to render any bot UA — the ClaudeBot v2 block is the visible part of a wider anti-bot policy. The v2 200 OK was a fast, near-empty response that the v2 detector did not flag as a block. The deeper content layer rejected the UA on a slower path.

This is the most important methodological lesson from the audit: an HTTP-only probe sees the WAF response; a JavaScript-rendering probe sees what the application would actually serve to the bot if it ran. The two layers can disagree.

2.5 The unreachable cohort (n = 13)

amazon.com · wayfair.com · sahibinden.com · very.co.uk · catch.com.au · boots.com · jumia.com.ng · ao.com · bunnings.com.au · kogan.com · etc.

Sites where even the HasData US residential proxy with full JS rendering got a 4xx, an HTTP 202 challenge, or an empty body. For these sites the audit cannot distinguish "blocks AI bots" from "blocks the proxy IP range we used". The findings are explicitly inconclusive.

The most-cited site in this cohort is amazon.com — Amazon's main US property serves an HTTP 202 challenge to our residential proxy, to our datacenter, and to every UA we tried. Amazon UK from the same residential range serves clean content. Amazon's edge policy varies by property and region.


3 · The 62-site WAF cluster — deeper look

Sixty-two sites returning the same fingerprint to the same four UAs suggests a single common rule. The candidates are well-known:

that, when toggled on, returns 403 with the standard Cloudflare

challenge body to UAs matching gptbot|chatgpt|claude|perplexity and variations.

produces equivalent behavior on AWS-fronted sites.

We did not fingerprint each site's CDN to attribute the rule precisely, but cross-reference with public CDN data (Cloudflare Radar, BuiltWith) shows the 62-site cluster is heavily Cloudflare-skewed, consistent with the managed-rule hypothesis.

The business effect of the rule, restated:

ingested by AI vendors as training data, and reduces server load from aggressive crawlers.

answer customer queries right now. A ChatGPT user looking for "best pressure washer" will not see Home Depot's products in the answer. A Claude user looking for "kitchen mixer comparison" will not see Best Buy. The retailer is voluntarily invisible at the precise moment a purchase intent is being expressed.

The correct policy is to split the two. OpenAI explicitly publishes three distinct UAs (GPTBot for training, OAI-SearchBot for search indexing, ChatGPT-User for live retrieval). Anthropic publishes three (ClaudeBot, Claude-SearchBot, Claude-User). Perplexity publishes two (PerplexityBot, Perplexity-User). The WAF managed rule treats all of them as one category. A site that wants to opt out of training but participate in AI shopping has to write the rule by hand. Most of the 62 sites did not write the rule by hand.


4 · The dynamic rendering finding — examined

Section 2.3 above documents 5 sites still serving disproportionately more content to bot UAs than to real users. The pattern Google formally deprecated in 2024 — if isBot(userAgent): return prerenderedSnapshot else return spaShell — persists at major retailers in 2026.

Amazon UK is the headline case. v3 rendering shows the discrimination is not browser-vs-bot but trusted-bots-vs-AI-bots: Googlebot, Bingbot, and Applebot get the full pre-rendered version; ChatGPT-User, Claude-User, and Perplexity-User get a 200–400 byte stub. This is an internal allowlist of which crawlers receive the indexed version of the page.

This is technically not "cloaking" in Google's sense — Google's policy permits showing crawlers a pre-rendered version of the same content that users would eventually see after JS hydration. The pattern is defensible. But it does create a competitive asymmetry: the older crawlers (Google, Bing) get a multi-year head start in indexing quality over the newer AI crawlers, whose users are increasingly the ones expressing purchase intent.

The audit cannot tell us why Amazon UK maintains this allowlist. It can only document that they do.


5 · Recommendations

5.1 For retailers

1. Audit your WAF preset. If "Block AI Bots" is on, write out the list of UAs it covers and decide each one individually. Block training UAs if that's your policy. Allow live-retrieval UAs in nearly every case. 2. Verify the UA via reverse-DNS. The published .openai.com, .anthropic.com, *.googlebot.com, etc. hosts let you confirm the request is from the vendor. Don't rely on UA strings alone. 3. Add Product, Offer, and BreadcrumbList JSON-LD to PDPs and category pages. This is the single biggest signal AI shopping queries use to surface a product. The cost is one Next.js plugin. 4. Migrate off dynamic rendering for real this time. If your bot version is 30× the size of your user version, the bots are seeing a different site than your customers — Google said to stop. SSR for everyone produces a smaller, faster, more correct system.

5.2 For AI vendors

1. Publish IP ranges and update them with stable APIs. OpenAI's https://openai.com/searchbot.json is exactly the right shape; Anthropic should follow. 2. Distinguish training from live retrieval in your UA strings. The biggest single industry-wide improvement would be unambiguous training-vs-retrieval taxonomy. OpenAI and Anthropic do this. Perplexity, Bytespider, and Meta should follow. 3. Honor the granular robots.txt directives. A site that blocks GPTBot but allows ChatGPT-User is signaling consent for live retrieval; respect that signal.

5.3 For PrerenderProxy (and edge prerendering generally)

The audit creates a reusable testbed of 100 known-state e-commerce homepages across 8 UAs. We can:

1. Re-run monthly and chart drift — a canary for "Cloudflare shipped a new rule today". 2. Use the 14 AI-ready sites as architectural references for customer builds. 3. Offer a free "is my site in the 62-site WAF cluster?" check as a lead-generation tool — the data is public, the answer is binary.


6 · Reproducibility

git clone --   (no git remote yet; raw under /srv/prerenderproxy/audit/2026-05-ecommerce-100/)
python3 -m pip install httpx playwright
playwright install chromium
python3 crawl.py            # 100 sites × 8 UAs, ~6 minutes
python3 v3_render.py        # 15 sites × 7 UAs (Playwright), ~2 minutes
python3 report.py           # writes index.html
python3 v3_report.py        # writes render-deepdive.html

Public datasets:

7 · Attribution

curated extension

Perplexity public docs · No Hacks "AI User-Agent Landscape 2026"

May 2026 · PrerenderProxy · joni@xwander.fi