Reverse-DNS bot verification — copy-paste recipes for seven platforms
The User-Agent header is a string a client sends. Anyone can send any string. In ten seconds an attacker can claim to be Googlebot, GPTBot, ClaudeBot or every other crawler simultaneously, and a site that allowlists by UA alone will trust them. The actual gate is at the IP layer.
This post is a working-code reference for the verification you should ship in production. Seven platforms, the three-step protocol, the canonical vendor IP sources. PrerenderProxy ships this natively — every classified-as-bot request is verified before it is served the rendered HTML — and if you operate your own stack the same recipes below are what we'd recommend.
1 · Why UA-only matching is a security hole, not a convention
In our May 2026 audit of the top 100 e-commerce sites, several major properties (Wikipedia, TikTok, eBay, cloud.microsoft, shopping.yahoo.co.jp on certain UAs) correctly returned HTTP 403 to a request that claimed to be Googlebot but came from our Hetzner datacenter IP. That was the expected, healthy behavior. Most sites in the cohort did the opposite: trusted the UA and served a prerender variant to anyone who set the header — including the scrapers, the price-aggregators, and the fake-Googlebot training crawlers that route through residential proxy networks for exactly this reason.
The check has three steps. None of them require third-party infrastructure.
- Reverse-DNS the requesting IP. Resolve the PTR record. You get a hostname like
crawl-66-249-66-1.googlebot.com. - Verify the hostname matches the vendor's published pattern. Googlebot is
*.googlebot.comor*.google.com; OpenAI is*.openai.com; Anthropic is*.anthropic.com; Apple is*.applebot.apple.com. The directory page for each bot lists the pattern. - Forward-DNS the hostname back to an IP. The result must match the original IP. This step is the one that defeats attackers who control a rDNS PTR record but not the forward A record.
The faster modern variant: most major vendors now publish their IP ranges as machine-readable JSON. If you load those into an allowlist refreshed daily, you skip the DNS lookups entirely. The IP-range approach and the rDNS approach are complementary; the recommended production pattern is "IP-range first, rDNS as the fallback for vendors that don't publish a JSON".
2 · The vendor IP-range JSON sources (May 2026)
Every URL below was verified HTTP 200 live as of 2026-05-19. They are the canonical sources; build your allowlist refresh against these and nothing else.
| Vendor | JSON URL | rDNS pattern (fallback) |
|---|---|---|
| Googlebot | developers.google.com/static/search/apis/ipranges/googlebot.json | *.googlebot.com · *.google.com |
| Bingbot | www.bing.com/toolbox/bingbot.json | *.search.msn.com |
| GPTBot | openai.com/gptbot.json | *.openai.com |
| OAI-SearchBot | openai.com/searchbot.json | *.openai.com |
| ChatGPT-User | openai.com/chatgpt-user.json | *.openai.com |
| Anthropic (ClaudeBot, Claude-User, Claude-SearchBot) | claude.com/crawling/bots.json | *.anthropic.com |
| Applebot | search.developer.apple.com/applebot.json | *.applebot.apple.com |
| PerplexityBot | www.perplexity.ai/perplexitybot.json | *.perplexity.ai |
| Perplexity-User | www.perplexity.ai/perplexity-user.json | *.perplexity.ai |
| Amazonbot | developer.amazon.com/amazonbot/ip-addresses/ (HTML; parse table) | (none documented) |
Each JSON has the same general shape — a top-level prefixes array containing objects with ipv4Prefix or ipv6Prefix fields. There is some drift between vendors (Apple uses ipv4; older Google docs used a flatter format) but the script in §3.1 normalizes them.
3 · Implementation recipes
3.1 · nginx + a cron-generated allowlist
This is the recommended pattern for the majority of self-hosted setups. A small Python script fetches every vendor JSON daily, generates an nginx geo block, and reloads nginx. The match itself happens at request time as a single O(log n) lookup against a CIDR trie — no DNS, no API calls, no measurable latency.
Save as /usr/local/bin/refresh-bot-ips.py:
#!/usr/bin/env python3
"""Refresh /etc/nginx/conf.d/bot-ips.conf from vendor JSON sources.
Run hourly or daily via systemd timer / cron. Atomic rename + nginx reload.
"""
from __future__ import annotations
import datetime, ipaddress, json, os, sys, urllib.request
SOURCES = {
"googlebot": "https://developers.google.com/static/search/apis/ipranges/googlebot.json",
"bingbot": "https://www.bing.com/toolbox/bingbot.json",
"gptbot": "https://openai.com/gptbot.json",
"oai_searchbot": "https://openai.com/searchbot.json",
"chatgpt_user": "https://openai.com/chatgpt-user.json",
"anthropic": "https://claude.com/crawling/bots.json",
"applebot": "https://search.developer.apple.com/applebot.json",
"perplexitybot": "https://www.perplexity.ai/perplexitybot.json",
"perplexity_user": "https://www.perplexity.ai/perplexity-user.json",
}
OUT = "/etc/nginx/conf.d/bot-ips.conf"
UA = "rdns-allowlist-refresher/1.0 (+https://prerenderproxy.com)"
def fetch_cidrs(url: str) -> list[str]:
req = urllib.request.Request(url, headers={"User-Agent": UA})
with urllib.request.urlopen(req, timeout=15) as r:
data = json.load(r)
out: list[str] = []
# Vendor JSONs use a few slightly different keys; normalize them here.
for p in data.get("prefixes", []) or data.get("ranges", []) or []:
for k in ("ipv4Prefix", "ipv6Prefix", "ipv4", "ipv6", "cidr"):
v = p.get(k) if isinstance(p, dict) else None
if v: out.append(v)
return out
def main() -> int:
lines = [
"# Auto-generated by refresh-bot-ips.py — do not edit by hand.",
f"# Generated: {datetime.datetime.now(datetime.timezone.utc).isoformat()}",
"",
"geo $verified_bot {",
' default "";',
]
for vendor, url in SOURCES.items():
try:
cidrs = fetch_cidrs(url)
except Exception as e:
print(f"warn: {vendor} fetch failed ({e}); keeping previous entries", file=sys.stderr)
continue
added = 0
for c in cidrs:
try:
ipaddress.ip_network(c, strict=False)
except ValueError:
continue
lines.append(f" {c} {vendor};")
added += 1
print(f"{vendor}: {added} ranges", file=sys.stderr)
lines += ["}", ""]
tmp = OUT + ".tmp"
with open(tmp, "w") as f: f.write("\n".join(lines))
os.rename(tmp, OUT)
os.system("/usr/sbin/nginx -t && /usr/sbin/nginx -s reload")
return 0
if __name__ == "__main__":
sys.exit(main())
Systemd timer (/etc/systemd/system/refresh-bot-ips.timer):
[Unit]
Description=Refresh verified-bot IP allowlist daily
[Timer]
OnCalendar=*-*-* 04:00:00
RandomizedDelaySec=30m
Persistent=true
[Install]
WantedBy=timers.target
The service unit just runs the script. Now in your nginx vhost the variable $verified_bot contains the vendor slug if the request IP is in any vendor's allowlist, or an empty string otherwise. Combine it with a UA regex to gate the prerender route:
map $http_user_agent $ua_claim {
default "";
"~*googlebot" googlebot;
"~*bingbot" bingbot;
"~*GPTBot" gptbot;
"~*OAI-SearchBot" oai_searchbot;
"~*ChatGPT-User" chatgpt_user;
"~*ClaudeBot" anthropic;
"~*Claude-User" anthropic;
"~*Claude-SearchBot" anthropic;
"~*PerplexityBot" perplexitybot;
"~*Perplexity-User" perplexity_user;
"~*Applebot" applebot;
}
# Compose a single boolean from the two factors — avoids "if is evil" entirely.
map "$ua_claim:$verified_bot" $is_verified_bot {
default 0;
"~*^googlebot:googlebot$" 1;
"~*^bingbot:bingbot$" 1;
"~*^gptbot:gptbot$" 1;
"~*^oai_searchbot:oai_searchbot$" 1;
"~*^chatgpt_user:chatgpt_user$" 1;
"~*^anthropic:anthropic$" 1;
"~*^perplexitybot:perplexitybot$" 1;
"~*^perplexity_user:perplexity_user$" 1;
"~*^applebot:applebot$" 1;
}
map $is_verified_bot $bot_upstream {
0 app_backend;
1 prerender_backend;
}
server {
# ... ssl/server_name ...
location / {
proxy_pass http://$bot_upstream;
}
}
Two implementation notes. The composite-key map pattern keeps the routing decision out of if blocks entirely — nginx's "if is evil" warning catches the older chained-conditional approach in production. And vendor JSONs sometimes ship empty during a maintenance window — the refresh script's except branch keeps the previous file intact when a fetch fails, rather than wiping the allowlist mid-day.
3.2 · OpenResty (nginx + Lua) — live rDNS for vendors that don't publish IPs
For bots that publish only an rDNS pattern and no JSON (Amazonbot, several niche crawlers), live verification at request time with Lua is the cleanest option. Cache positive results in lua_shared_dict for an hour to keep cost flat.
http {
lua_shared_dict verified_ips 10m;
init_by_lua_block {
require "resty.dns.resolver"
}
server {
# ...
set $bot_rdns_ok 0;
access_by_lua_block {
local ip = ngx.var.remote_addr
local cache = ngx.shared.verified_ips
local cached = cache:get(ip)
if cached ~= nil then
ngx.var.bot_rdns_ok = cached
return
end
local resolver = require("resty.dns.resolver")
local r, err = resolver:new({ nameservers = {"8.8.8.8", "1.1.1.1"}, timeout = 2000 })
if not r then ngx.log(ngx.ERR, "dns init: ", err); ngx.var.bot_rdns_ok = 0; return end
-- IPv4 only here. For IPv6 (Googlebot/Applebot do use IPv6), branch into ip6.arpa
-- and TYPE_AAAA — the protocol is identical, only the encoding changes.
if ip:find(":", 1, true) then
cache:set(ip, "0", 3600); ngx.var.bot_rdns_ok = 0; return
end
-- PTR lookup
local parts = {}
for octet in ip:gmatch("(%d+)") do parts[#parts+1] = octet end
if #parts ~= 4 then ngx.var.bot_rdns_ok = 0; return end
local ptr_name = parts[4].."."..parts[3].."."..parts[2].."."..parts[1]..".in-addr.arpa"
local ans, rerr = r:query(ptr_name, { qtype = r.TYPE_PTR })
if not ans or not ans[1] or not ans[1].ptrdname then
cache:set(ip, "0", 3600); ngx.var.bot_rdns_ok = 0; return
end
-- resty.dns sometimes returns the PTR with a trailing dot — normalize first.
local host = ans[1].ptrdname:gsub("%.$", "")
-- Match against allowed patterns
local ok_pattern = host:match("%.googlebot%.com$") or host:match("%.google%.com$")
or host:match("%.openai%.com$")
or host:match("%.anthropic%.com$")
or host:match("%.applebot%.apple%.com$")
or host:match("%.perplexity%.ai$")
or host:match("%.search%.msn%.com$")
if not ok_pattern then cache:set(ip, "0", 600); ngx.var.bot_rdns_ok = 0; return end
-- Forward DNS to confirm the PTR is honest
local fwd, ferr = r:query(host, { qtype = r.TYPE_A })
if not fwd then cache:set(ip, "0", 60); ngx.var.bot_rdns_ok = 0; return end
for _, rec in ipairs(fwd) do
if rec.address == ip then
cache:set(ip, "1", 3600)
ngx.var.bot_rdns_ok = 1
return
end
end
cache:set(ip, "0", 600)
ngx.var.bot_rdns_ok = 0
}
}
}
Combine $bot_rdns_ok with the UA-claim regex from §3.1 to get the same two-factor outcome (rDNS-verified IP × UA match).
3.3 · Cloudflare — use the built-in Verified Bots feature
Cloudflare maintains the IP and rDNS verification machinery internally as part of the Verified Bots programme and the newer AI Crawl Control product. If your site is fronted by Cloudflare, do not roll your own — use the platform signal.
In a Worker:
export default {
async fetch(request, env) {
// Cloudflare's bot-management signals live under cf.botManagement,
// not at the cf object root. The verifiedBot boolean covers the
// rDNS + IP-range check Cloudflare runs against the bot's
// published infrastructure.
const isVerified = request.cf?.botManagement?.verifiedBot === true;
const ua = request.headers.get("user-agent") || "";
if (isVerified && /(googlebot|bingbot|gptbot|claudebot|perplexitybot|applebot|chatgpt-user|claude-user|oai-searchbot)/i.test(ua)) {
// Route to prerender origin. Build a new Request so the Host header
// matches the rewritten URL — passing request as a second arg to
// fetch(url.toString(), request) keeps the old Host and confuses origins.
const url = new URL(request.url);
url.hostname = "prerender.example.com";
return fetch(new Request(url, request));
}
return fetch(request);
}
}
In a WAF / Page Rule:
(cf.bot_management.verified_bot)
and (http.user_agent contains "GPTBot"
or http.user_agent contains "Googlebot"
or http.user_agent contains "ClaudeBot")
The cf.bot_management.verified_bot field (the WAF-side spelling) and request.cf.botManagement.verifiedBot (the Worker-side spelling) refer to the same signal: Cloudflare's combined rDNS + IP-range check against the bot's published infrastructure. The verified-bot directory itself is maintained by Cloudflare; you don't refresh it. botManagement is only fully populated on Enterprise plans, but the simpler verifiedBot boolean is exposed across plans.
3.4 · Fastly VCL — edge dictionary + ACL
This is the pattern PrerenderProxy itself uses. An edge dictionary is populated by a refresh job from the vendor JSONs (Fastly's API supports bulk updates); VCL checks client.ip against the dictionary on every request.
acl googlebot_acl {
# Populated by /usr/local/bin/refresh-fastly-acls.py via Fastly API
# see fastly.com/documentation/reference/api/acls
}
acl openai_acl { /* ... */ }
acl anthropic_acl { /* ... */ }
acl perplexity_acl { /* ... */ }
acl applebot_acl { /* ... */ }
sub vcl_recv {
declare local var.ua_claim STRING;
declare local var.ip_vendor STRING;
set var.ua_claim = "";
set var.ip_vendor = "";
if (req.http.User-Agent ~ "(?i)googlebot") { set var.ua_claim = "google"; }
if (req.http.User-Agent ~ "(?i)bingbot") { set var.ua_claim = "bing"; }
if (req.http.User-Agent ~ "(?i)GPTBot|OAI-SearchBot|ChatGPT-User") { set var.ua_claim = "openai"; }
if (req.http.User-Agent ~ "(?i)ClaudeBot|Claude-User|Claude-SearchBot") { set var.ua_claim = "anthropic"; }
if (req.http.User-Agent ~ "(?i)PerplexityBot|Perplexity-User") { set var.ua_claim = "perplexity"; }
if (req.http.User-Agent ~ "(?i)Applebot") { set var.ua_claim = "applebot"; }
if (client.ip ~ googlebot_acl) { set var.ip_vendor = "google"; }
if (client.ip ~ openai_acl) { set var.ip_vendor = "openai"; }
if (client.ip ~ anthropic_acl) { set var.ip_vendor = "anthropic"; }
if (client.ip ~ perplexity_acl) { set var.ip_vendor = "perplexity"; }
if (client.ip ~ applebot_acl) { set var.ip_vendor = "applebot"; }
if (var.ua_claim != "" && var.ua_claim == var.ip_vendor) {
set req.http.X-Verified-Bot = var.ip_vendor;
set req.backend = F_prerender_backend;
}
}
The refresh job is identical in spirit to the nginx Python script — only the output target changes. On Fastly, use the ACL Entries bulk-update endpoint (PATCH /service/{id}/acl/{acl_id}/entries) to atomically replace the CIDR set in one API call, then activate the new service version. Avoid per-entry upserts: they churn the version graph and you'll hit per-version quota limits within a week.
3.5 · Vercel — Edge Middleware
Vercel's middleware runs at the edge before your function or static asset is served. Embed the allowlist as a build-time JSON import; refresh by re-deploying nightly (or use an Edge Config for runtime updates).
// middleware.ts
import { NextRequest, NextResponse } from "next/server";
import allowlist from "@/lib/bot-allowlist.json"; // built nightly by CI from vendor JSONs
import CIDR from "ip-cidr"; // proven IPv4 + IPv6 matcher
export function middleware(request: NextRequest) {
const ua = request.headers.get("user-agent") ?? "";
// X-Forwarded-For first — NextRequest.ip can be null on Edge dev/local.
// The leftmost IP in XFF is the actual client; trust your CDN to strip downstream.
const ip = (request.headers.get("x-forwarded-for")?.split(",")[0]?.trim()
?? request.ip ?? "").trim();
if (!ip) return NextResponse.next();
const claimedVendor =
/googlebot/i.test(ua) ? "googlebot" :
/gptbot/i.test(ua) ? "gptbot" :
/chatgpt-user/i.test(ua) ? "chatgpt_user" :
/oai-searchbot/i.test(ua) ? "oai_searchbot" :
/claudebot/i.test(ua) ? "anthropic" :
/claude-user/i.test(ua) ? "anthropic" :
/perplexitybot/i.test(ua) ? "perplexitybot" :
/perplexity-user/i.test(ua) ? "perplexity_user" :
/applebot/i.test(ua) ? "applebot" :
/bingbot/i.test(ua) ? "bingbot" : "";
if (!claimedVendor) return NextResponse.next();
const cidrs = (allowlist as Record)[claimedVendor] ?? [];
// ip-cidr handles both IPv4 and IPv6 cleanly — don't reinvent bit math here,
// crawlers do use IPv6 and naïve split('.') drops half the traffic silently.
const verified = cidrs.some(c => { try { return new CIDR(c).contains(ip); } catch { return false; } });
if (verified) {
const url = request.nextUrl.clone();
url.pathname = `/__prerender${url.pathname}`; // your rewrite target
return NextResponse.rewrite(url);
}
return NextResponse.next();
}
export const config = { matcher: "/((?!_next/static|api/|favicon).*)" };
3.6 · AWS — CloudFront Function (viewer-request)
CloudFront Functions run synchronously per request and are billed per million invocations rather than per-millisecond, so they're cheaper than Lambda@Edge for this kind of small allowlist check. The function is restricted to no-network operations — the allowlist must be embedded.
// viewer-request CloudFront Function · IPv4-only by design
function handler(event) {
var req = event.request;
var ua = (req.headers["user-agent"] && req.headers["user-agent"].value) || "";
var ip = event.viewer.ip;
// CloudFront Functions don't have 128-bit integers — bypass IPv6 here and
// either run a parallel Lambda@Edge function for IPv6 or accept the
// (small but real) v6 traffic share as unverified. For Googlebot and
// Applebot in particular, this means some legitimate bot traffic falls
// through to the standard path. Tune to your traffic mix.
if (ip.indexOf(":") !== -1) return req;
// Embedded build-time allowlist. Replace at deploy time with a fetched copy
// of the vendor JSONs. Truncated here.
var allow = {
googlebot: ["66.249.64.0/19"],
gptbot: ["20.171.207.0/24"]
// anthropic, perplexity, applebot, bingbot, oai_searchbot, chatgpt_user...
};
var claim = null;
if (/googlebot/i.test(ua)) claim = "googlebot";
else if (/gptbot/i.test(ua)) claim = "gptbot";
else if (/claudebot/i.test(ua)) claim = "anthropic";
// ... extend as needed
if (!claim) return req;
function ipv4ToInt(s) {
var p = s.split(".");
if (p.length !== 4) return -1;
return ((((+p[0]) << 24) | ((+p[1]) << 16) | ((+p[2]) << 8) | (+p[3])) >>> 0);
}
var ipn = ipv4ToInt(ip);
if (ipn < 0) return req;
var cidrs = allow[claim] || [];
for (var i = 0; i < cidrs.length; i++) {
var parts = cidrs[i].split("/");
var bn = ipv4ToInt(parts[0]);
var bits = parseInt(parts[1], 10);
if (bn < 0 || isNaN(bits)) continue;
var mask = bits === 0 ? 0 : ((0xffffffff << (32 - bits)) >>> 0);
if ((ipn & mask) === (bn & mask)) {
req.headers["x-verified-bot"] = { value: claim };
req.uri = "/prerender" + req.uri;
return req;
}
}
return req;
}
For IPv6 support and rDNS lookups, switch to Lambda@Edge — the JS runtime has full BigInt and arbitrary DNS-resolver libraries; the trade-off is per-millisecond billing vs. CloudFront Functions' flat per-invocation rate.
3.7 · Apache — IP allowlist (no rDNS without mod_perl)
Apache 2.4's mod_authz_host actually does support double-reverse DNS natively via Require host *.googlebot.com — that single directive performs the rDNS-then-forward-DNS round trip and admits the request only if the forward lookup resolves back to the original IP. Use that for a vanilla rDNS gate. The harder part is combining rDNS with the vendor's published IP-range JSON for a two-factor check (or doing CIDR-based allowlist matching at all): the txt: form of RewriteMap does exact string lookups only, not CIDR matching, so a line like 66.249.64.0/19 googlebot in a txt map will never match a client IP of 66.249.64.12. Two practical options for the CIDR side:
Option A — pre-expand to a flat IP list (small vendors only; impractical for Google's /19s):
RewriteEngine On
RewriteMap botips "txt:/etc/apache2/bot-ips.txt"
# bot-ips.txt content (one line per individual IP — refreshed nightly):
# 66.249.64.1 googlebot
# 66.249.64.2 googlebot
# ...
# 20.171.207.5 gptbot
RewriteCond ${botips:%{REMOTE_ADDR}|none} !^none$
RewriteCond %{HTTP_USER_AGENT} (Googlebot|GPTBot|ClaudeBot|PerplexityBot|Applebot|bingbot) [NC]
RewriteRule ^(.*)$ /__prerender$1 [E=VERIFIED_BOT:%1,L]
Option B — a prg: external daemon (production-grade; the only sane choice at vendor scale). Apache forks a long-running process and pipes lookup requests through stdin/stdout; the daemon does the actual CIDR-trie match against an in-memory representation of all vendor JSONs:
RewriteMap botips "prg:/usr/local/bin/bot-ip-lookup"
# bot-ip-lookup is a small Python/Go program that reads an IP per line on stdin
# and emits "vendor\n" or "NULL\n" per line. Keep it long-running; Apache
# starts one per worker.
Apache is the platform where this work is hardest. If you're starting from a blank Apache install, the cleanest answer is "put Cloudflare or a small nginx box in front", which inherits one of the recipes above.
4 · PrerenderProxy ships this natively
The recipes above are what you write yourself if you operate your own edge. PrerenderProxy bundles the equivalent: vendor JSON refresh runs as a background job; the Fastly ACLs are populated from it; every request the system classifies as "verified bot" has gone through the IP-range + UA-claim two-factor check before it hits the Puppeteer prerender service. There is no UA-only mode shipped — if you allowlist a bot in our config, you allowlist its IP infrastructure, not its UA string.
The decision to make rDNS verification mandatory rather than configurable was deliberate. In our top-100 audit, the sites that took a content-quality hit from AI-bot impersonation were the ones with UA-only matching turned on; the sites with rDNS gates were unaffected by the same scraper traffic. Adding "UA only, please" as a checkbox would let users opt themselves into a footgun. We chose to ship without it.
5 · Operational tips that the recipes don't include
- Cache the refresh failure case, not the success case. If a vendor JSON returns a 5xx for an hour, you don't want your allowlist to evaporate. Keep the previous file in place and emit a metric so you notice the staleness later.
- Update on a randomized offset. The
RandomizedDelaySec=30min the systemd timer above is not a stylistic choice — it spreads the load on the vendor's static-JSON endpoint and avoids becoming a thundering herd if every customer's refresher fires at04:00:00 UTCexactly. - Log the verification outcome with a single tag per request. A
X-Verified-Bot: googlebotresponse header (stripped before delivery, but kept in your internal logs) makes "what bots actually hit us today?" a single Loki / Splunk query. - Don't 403 a failed verification by default. The right default is "treat as untrusted; serve the regular SPA, just don't route to the prerender variant". 403 by default means a real Googlebot from an IP that hasn't been added to the published list yet (rare but it happens) gets a content-loss spike.
- Test the failure mode. A simple curl from your own machine claiming to be Googlebot should land on the non-bot path. Add this as a synthetic check; alert if it ever flips.
6 · Summary
UA-only matching is a security mistake that has cost real production sites real content-loss incidents in 2024–2026. The fix is well-defined: rDNS + forward-DNS + vendor-published IP-range allowlist, refreshed daily, applied as a two-factor gate alongside the UA claim. The recipes above are working starting points for the seven most common platforms; PrerenderProxy ships the equivalent out of the box if you'd rather not run the refresh job yourself.
Related: Bot Directory · Brand robots.txt strategy · Publisher decision framework