Case study · AI-bot visibility across the world's 100 largest e-commerce properties
May 2026 · PrerenderProxy · public dataset and methodology
Executive summary
We audited the homepages of the 100 largest e-commerce sites in the world against ten different user-agent headers — Chrome desktop and mobile, Googlebot, Bingbot, GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, PerplexityBot, Applebot — first via direct HTTP probes, then for fifteen high-signal cases via a real headless Chromium browser. The objective was to answer one question:
> When an AI shopping assistant looks up a product, does the retailer's > front page actually serve content that the AI can read?
Topline:
- 14 / 100 sites are fully AI-ready — every declared AI crawler that
reached them got a meaningful HTML page with parseable content.
- 62 / 100 sites block at least one named AI crawler outright. The
block pattern is strikingly uniform: 62 sites refuse the same four user-agents (GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot) with what appears to be the same managed WAF ruleset.
- 6 / 100 sites surgically block only ClaudeBot. Four are eBay
properties, plus nike.com and canadiantire.ca.
- 5 / 100 sites still practice dynamic rendering — serving bot UAs
3–30× more pre-rendered HTML than they serve a real user. Amazon UK is the most extreme at ~30×.
- 13 / 100 are inconclusive — IP-walled from our origin and from
residential proxies alike.
The aggregate picture: a majority of large e-commerce sites have voluntarily reduced their visibility to AI-driven shopping queries through 2026, while continuing to invest in classical SEO infrastructure that AI bots could read just as easily if they were allowed in.
1 · Methodology
1.1 Sample frame
The site list combines Similarweb's "Ecommerce & Shopping" category ranking (April 2026 snapshot) for ranks 1–50 with a curated second half (51–100) covering the major fashion, electronics, home, department-store, beauty, sports, and regional ecommerce leaders not represented in the first half. Full list and rationale: sites.json in the public repository.
The list is geographically distributed (US, UK, DE, FR, IT, ES, NL, PL, RU, TR, JP, KR, CN, IN, SE Asia, Latin America, Australia, Africa) and verticals include marketplace (45), classifieds (6), fashion (9), electronics (8), home (7), department (6), beauty (4), sports (4), big-box (5), grocery (1), tickets (1), health (1), books-electronics (1).
1.2 User-agent matrix
Three groups:
- Human reference — Chrome Desktop (Windows). For 100 sites the
reference fetch came through a HasData residential US proxy with full JavaScript execution, simulating a real user browsing from a US residential IP. A second Chrome row used a direct HTTP curl from our Hetzner datacenter, controlling for IP-friendliness.
- Traditional search — Googlebot (smartphone profile), Bingbot.
- AI — GPTBot (training), ChatGPT-User (live retrieval), ClaudeBot
(training), PerplexityBot (training/index).
For 15 hand-picked sites we additionally launched a real headless Chromium with each of: Googlebot, Bingbot, Applebot, ChatGPT-User, Claude-User, Perplexity-User. (The training-only bots — GPTBot, ClaudeBot — were excluded from the rendered run because they do not execute JavaScript in production.)
1.3 Per-fetch instrumentation
For every (site, UA) cell we captured:
- HTTP status, final URL, byte length, response time
<title>,<h1>, canonical, meta description, robots meta- All JSON-LD
@typevalues present (walking nested structures) - Boolean: has
Product,Offer,BreadcrumbList,Organizationschema - Visible text length (HTML stripped, script/style removed, whitespace
collapsed)
- SHA-256 of the normalized visible text
- Word-token set, used to compute Jaccard similarity vs the Chrome
baseline
- Multi-currency price tokens visible
hreflangcount, product-pattern link count- Block reason (HTTP code, Cloudflare challenge keywords, captcha,
rate-limit) — explicit None if the response looked normal
1.4 AI-readiness score 0–5
We rolled the per-cell metrics into a single integer score for each (site, UA):
| condition | adds |
|---|---|
| HTTP 200 with > 1 KB body | +1 |
| visible text > 500 characters | +1 |
<title> present and ≥ 3 chars | +1 |
| any JSON-LD block present | +1 |
has Product/Offer schema OR visible price OR ≥ 5 product-shaped <a href> | +1 |
Per-site verdict from the four AI-bot scores:
ai_ready— min ≥ 4partial— max in [3, 4]thin— max ≤ 2 and not blockedai_blocked— at least one AI UA hard-blockedunreachable— Chrome baseline itself failed
1.5 Asynchrony and concurrency
The harness is asyncio with bounded concurrency: 6 sites in parallel at the outer loop, 16 direct curls in flight at any moment, 3 HasData residential calls in flight (the chokepoint). 100 sites × 8 UAs ≈ 800 fetches completed in 6 minutes wall-clock.
1.6 Public reproducibility
All code (crawl.py, report.py, v3_render.py, v3_report.py), inputs (sites.json, uas.json), and outputs (per-site JSON + saved HTML bodies + screenshots) are public under /srv/prerenderproxy/audit/2026-05-ecommerce-100/ and served at https://prerenderproxy.com/audit/2026-05-ecommerce-100/.
2 · Findings by verdict cohort
2.1 The AI-ready cohort (n = 14)
walmart.com · rakuten.co.jp · target.com · trendyol.com · craigslist.org · alibaba.com · samsung.com · shein.com · apple.com · nordstrom.com · ulta.com · newegg.com · otto.de · decathlon.com
These sites pass every AI-readiness criterion. Common architectural traits:
- One HTML response per request, served to every UA. SHA-256 of
the normalized visible text matches across UAs (or differs only in cache-busting tokens and analytics IDs).
- Server-side rendering as the default, not a special bot mode.
The same Next.js / Nuxt / custom Node stack that ships HTML to browsers is what bots see.
- JSON-LD
OrganizationandWebSiteon the homepage, with
Product / Offer schema reserved for PDPs (which we did not test, but inference from sitemap structure is high-confidence).
- No Cloudflare "Block AI Bots" preset. These sites either
consciously evaluated and rejected the rule, use a different CDN, or were never tempted.
Vertical breakdown of AI-ready cohort: marketplace 5 · big-box 1 · department 1 · beauty 1 · electronics 3 · fashion 1 · sports 1 · classifieds 1.
2.2 The blocked cohort (n = 62)
We do not list all 62 here — the heatmap on /audit/2026-05-ecommerce-100/ shows them. The high-traffic names are notable:
- US retail: best buy, costco, home depot, lowe's, macy's, kohl's,
sephora, wayfair (also IP-walled), nike (technically ClaudeBot-only in v2; broader in v3)
- Fashion: zalando.de, asos.com, zara.com, hm.com, lululemon.com,
adidas.com
- Home & furniture: ikea.com, lowe's, home depot, wayfair
- Latin American marketplaces: mercadolivre.com.br, mercadolibre.com.mx,
mercadolibre.com.ar, shopee (.com.br, .co.id, .vn, .co.th, .sg), olx.com.br
- EU retail: john lewis, currys, mediamarkt.de, saturn.de, cdiscount.com,
argos.co.uk
- Asia: jd.com (regional issues), tmall.com, pinduoduo.com, gmarket.co.kr,
shopping.yahoo.co.jp (also serves bot UAs more content)
The strong recurring response pattern: HTTP 403 with ~150 KB body containing a cf-mitigated header or the literal challenge HTML. This is Cloudflare's default response when its "Block AI Bots" managed rule fires. The same body shape returns at AWS WAF and Akamai when their equivalents fire.
Important nuance from v3. Of the 5 v3 sites where v2 flagged "AI blocked", 3 (amazon.co.uk, coupang.com, shopping.yahoo.co.jp, uniqlo.com) showed partial render success when probed with a real Chromium browser: search bots got through (Googlebot/Bingbot/Applebot UAs received content), AI bots did not (ChatGPT-User/Claude-User/ Perplexity-User got 403 or empty). The block is therefore a search-vs-AI discrimination, not a universal anti-bot policy.
2.3 The dynamic rendering cohort (n = 5)
Sites where bot UAs receive substantially more pre-rendered content than the residential Chrome baseline:
| site | Chrome (residential) | bot UA | ratio |
|---|---|---|---|
| amazon.co.uk | ~30 KB shell | Googlebot/Bingbot ~900 KB | 30× |
| shopping.yahoo.co.jp | thin | Googlebot/Bingbot ~95 KB | 7–8× |
| canadiantire.ca | small | AI bots ~36 KB | 5.9× |
| coupang.com | shell + 403 to direct Chrome | Googlebot 14 KB | 5.3× |
| uniqlo.com | small | Googlebot ~5 KB | 3× |
All five are practicing the technique Google formally deprecated in 2024. The v3 rendered audit confirms that for amazon.co.uk and shopping.yahoo.co.jp the discrimination persists at the rendered DOM layer — these are not browser-rendering artifacts. Coupang and shopping.yahoo.co.jp additionally IP-wall real Chrome from our datacenter while trusting the Googlebot UA from the same IP. That is the unsafe UA-only allowlist pattern: a scraper that sets the right header gets through; a legitimate user from a VPN does not.
2.4 The ClaudeBot-only cohort (n = 6)
ebay.com · ebay.de · ebay.co.uk · kleinanzeigen.de · nike.com · canadiantire.ca
Six sites block exactly one AI bot — ClaudeBot — while their HTTP responses to GPTBot, ChatGPT-User, and PerplexityBot are 200 OK. The four eBay properties are organizationally consistent (kleinanzeigen is wholly eBay-owned). Nike and Canadian Tire are independent.
v3 (rendered, fifteen-site deep-dive) found that all five of these sites we re-tested with a real Chromium browser actually fail to render any bot UA — the ClaudeBot v2 block is the visible part of a wider anti-bot policy. The v2 200 OK was a fast, near-empty response that the v2 detector did not flag as a block. The deeper content layer rejected the UA on a slower path.
This is the most important methodological lesson from the audit: an HTTP-only probe sees the WAF response; a JavaScript-rendering probe sees what the application would actually serve to the bot if it ran. The two layers can disagree.
2.5 The unreachable cohort (n = 13)
amazon.com · wayfair.com · sahibinden.com · very.co.uk · catch.com.au · boots.com · jumia.com.ng · ao.com · bunnings.com.au · kogan.com · etc.
Sites where even the HasData US residential proxy with full JS rendering got a 4xx, an HTTP 202 challenge, or an empty body. For these sites the audit cannot distinguish "blocks AI bots" from "blocks the proxy IP range we used". The findings are explicitly inconclusive.
The most-cited site in this cohort is amazon.com — Amazon's main US property serves an HTTP 202 challenge to our residential proxy, to our datacenter, and to every UA we tried. Amazon UK from the same residential range serves clean content. Amazon's edge policy varies by property and region.
3 · The 62-site WAF cluster — deeper look
Sixty-two sites returning the same fingerprint to the same four UAs suggests a single common rule. The candidates are well-known:
- Cloudflare ships a managed rule called "Block AI Bots / Scrapers"
that, when toggled on, returns 403 with the standard Cloudflare
challenge body to UAs matching gptbot|chatgpt|claude|perplexity and variations.
- AWS WAF Bot Control has a "Targeted AI bots" rule group that
produces equivalent behavior on AWS-fronted sites.
- Akamai Bot Manager offers an AI category in its policy editor.
We did not fingerprint each site's CDN to attribute the rule precisely, but cross-reference with public CDN data (Cloudflare Radar, BuiltWith) shows the 62-site cluster is heavily Cloudflare-skewed, consistent with the managed-rule hypothesis.
The business effect of the rule, restated:
- What it accomplishes — opts the site out of having its content
ingested by AI vendors as training data, and reduces server load from aggressive crawlers.
- What it costs — the same toggle blocks live retrieval bots that
answer customer queries right now. A ChatGPT user looking for "best pressure washer" will not see Home Depot's products in the answer. A Claude user looking for "kitchen mixer comparison" will not see Best Buy. The retailer is voluntarily invisible at the precise moment a purchase intent is being expressed.
The correct policy is to split the two. OpenAI explicitly publishes three distinct UAs (GPTBot for training, OAI-SearchBot for search indexing, ChatGPT-User for live retrieval). Anthropic publishes three (ClaudeBot, Claude-SearchBot, Claude-User). Perplexity publishes two (PerplexityBot, Perplexity-User). The WAF managed rule treats all of them as one category. A site that wants to opt out of training but participate in AI shopping has to write the rule by hand. Most of the 62 sites did not write the rule by hand.
4 · The dynamic rendering finding — examined
Section 2.3 above documents 5 sites still serving disproportionately more content to bot UAs than to real users. The pattern Google formally deprecated in 2024 — if isBot(userAgent): return prerenderedSnapshot else return spaShell — persists at major retailers in 2026.
Amazon UK is the headline case. v3 rendering shows the discrimination is not browser-vs-bot but trusted-bots-vs-AI-bots: Googlebot, Bingbot, and Applebot get the full pre-rendered version; ChatGPT-User, Claude-User, and Perplexity-User get a 200–400 byte stub. This is an internal allowlist of which crawlers receive the indexed version of the page.
This is technically not "cloaking" in Google's sense — Google's policy permits showing crawlers a pre-rendered version of the same content that users would eventually see after JS hydration. The pattern is defensible. But it does create a competitive asymmetry: the older crawlers (Google, Bing) get a multi-year head start in indexing quality over the newer AI crawlers, whose users are increasingly the ones expressing purchase intent.
The audit cannot tell us why Amazon UK maintains this allowlist. It can only document that they do.
5 · Recommendations
5.1 For retailers
1. Audit your WAF preset. If "Block AI Bots" is on, write out the list of UAs it covers and decide each one individually. Block training UAs if that's your policy. Allow live-retrieval UAs in nearly every case. 2. Verify the UA via reverse-DNS. The published .openai.com, .anthropic.com, *.googlebot.com, etc. hosts let you confirm the request is from the vendor. Don't rely on UA strings alone. 3. Add Product, Offer, and BreadcrumbList JSON-LD to PDPs and category pages. This is the single biggest signal AI shopping queries use to surface a product. The cost is one Next.js plugin. 4. Migrate off dynamic rendering for real this time. If your bot version is 30× the size of your user version, the bots are seeing a different site than your customers — Google said to stop. SSR for everyone produces a smaller, faster, more correct system.
5.2 For AI vendors
1. Publish IP ranges and update them with stable APIs. OpenAI's https://openai.com/searchbot.json is exactly the right shape; Anthropic should follow. 2. Distinguish training from live retrieval in your UA strings. The biggest single industry-wide improvement would be unambiguous training-vs-retrieval taxonomy. OpenAI and Anthropic do this. Perplexity, Bytespider, and Meta should follow. 3. Honor the granular robots.txt directives. A site that blocks GPTBot but allows ChatGPT-User is signaling consent for live retrieval; respect that signal.
5.3 For PrerenderProxy (and edge prerendering generally)
The audit creates a reusable testbed of 100 known-state e-commerce homepages across 8 UAs. We can:
1. Re-run monthly and chart drift — a canary for "Cloudflare shipped a new rule today". 2. Use the 14 AI-ready sites as architectural references for customer builds. 3. Offer a free "is my site in the 62-site WAF cluster?" check as a lead-generation tool — the data is public, the answer is binary.
6 · Reproducibility
git clone -- (no git remote yet; raw under /srv/prerenderproxy/audit/2026-05-ecommerce-100/)
python3 -m pip install httpx playwright
playwright install chromium
python3 crawl.py # 100 sites × 8 UAs, ~6 minutes
python3 v3_render.py # 15 sites × 7 UAs (Playwright), ~2 minutes
python3 report.py # writes index.html
python3 v3_report.py # writes render-deepdive.html
Public datasets:
summary.json— full v2 resultsdata/<domain>__summary.json— per-site v2 detaildata/<domain>/<ua>.html— raw fetched HTML for each (site, UA)v3-data/v3-summary.json— full v3 resultsv3-data/shots/<domain>/<ua>.jpg— per (site, UA) screenshot
7 · Attribution
- Site ranking: Similarweb "Ecommerce & Shopping" April 2026 +
curated extension
- Residential proxy + JS rendering: HasData web scraping API
- Headless browser: Microsoft Playwright + Chromium
- UA reference: OpenAI, Anthropic, Google, Microsoft, Apple,
Perplexity public docs · No Hacks "AI User-Agent Landscape 2026"
May 2026 · PrerenderProxy · joni@xwander.fi