CCBot — User-agent, IPs, robots.txt · PrerenderProxy Bot Directory

A nonprofit-operated crawl that ingests the public web monthly and releases the resulting ~3 PB corpus on AWS S3. Essentially every large language model trained through 2024 cites Common Crawl as a major source.

Specs

Vendor	Common Crawl Foundation
Category	MEMORY
robots.txt token	`CCBot`
Renders JavaScript	HTTP only
Honors robots.txt	yes
Reverse-DNS pattern	`*.commoncrawl.org`

User-Agent string

CCBot/2.0 (https://commoncrawl.org/faq/)

Considerations

Blocking CCBot is the single most upstream way to opt out of LLM training broadly. Content already in past snapshots stays in those datasets.
Volume is moderate — one full crawl per month, not continuous.
Many sites that adopted 'Block AI Bots' WAF presets in 2024 did not also block CCBot, leaving the most upstream feed open.

robots.txt recipe

User-agent: CCBot
Disallow: /

Sources: Common Crawl · About CCBot

← Back to directory