CCBot
A nonprofit-operated crawl that ingests the public web monthly and releases the resulting ~3 PB corpus on AWS S3. Essentially every large language model trained through 2024 cites Common Crawl as a major source.
Specs
| Vendor | Common Crawl Foundation |
| Category | MEMORY |
| robots.txt token | CCBot |
| Renders JavaScript | HTTP only |
| Honors robots.txt | yes |
| Reverse-DNS pattern | *.commoncrawl.org |
User-Agent string
CCBot/2.0 (https://commoncrawl.org/faq/)Considerations
- Blocking CCBot is the single most upstream way to opt out of LLM training broadly. Content already in past snapshots stays in those datasets.
- Volume is moderate — one full crawl per month, not continuous.
- Many sites that adopted 'Block AI Bots' WAF presets in 2024 did not also block CCBot, leaving the most upstream feed open.
robots.txt recipe
User-agent: CCBot
Disallow: /
Sources: Common Crawl · About CCBot