CCBot (Common Crawl)
| Vendor | Common Crawl Foundation (501(c)(3) nonprofit) |
| Type | Open web crawler — produces freely-available web corpus used by most AI training datasets |
| robots.txt token | CCBot |
| JavaScript rendering | No |
| Honors robots.txt | Yes |
| Vendor docs | commoncrawl.org/ccbot |
User-Agent string
CCBot/2.0 (https://commoncrawl.org/faq/)
Purpose
Common Crawl is a nonprofit that crawls the public web monthly and releases the resulting corpus (~3 PB compressed per snapshot) as a freely downloadable dataset on AWS S3. The dataset is the foundational training corpus behind essentially every large language model trained through 2024 (GPT-3, GPT-4, Claude 1/2, LLaMA 1/2/3, and most open LLMs cite Common Crawl as a major source).
Therefore: blocking CCBot is the most upstream way a site can opt out of LLM training broadly — but the opt-out only affects future crawls and future training runs. Content already in past Common Crawl snapshots remains in those datasets.
Quirks
- Volume is moderate — one full crawl per month, not continuous.
- Honors robots.txt strictly.
- Many sites that adopted "Block AI Bots" in 2024 did NOT also block
CCBot, leaving the most upstream training feed open.
How to allow / block
To opt out of LLM training upstream:
User-agent: CCBot
Disallow: /