MEMORY · Common Crawl Foundation

CCBot

A nonprofit-operated crawl that ingests the public web monthly and releases the resulting ~3 PB corpus on AWS S3. Essentially every large language model trained through 2024 cites Common Crawl as a major source.

Specs

VendorCommon Crawl Foundation
CategoryMEMORY
robots.txt tokenCCBot
Renders JavaScriptHTTP only
Honors robots.txtyes
Reverse-DNS pattern*.commoncrawl.org

User-Agent string

CCBot/2.0 (https://commoncrawl.org/faq/)

Considerations

  • Blocking CCBot is the single most upstream way to opt out of LLM training broadly. Content already in past snapshots stays in those datasets.
  • Volume is moderate — one full crawl per month, not continuous.
  • Many sites that adopted 'Block AI Bots' WAF presets in 2024 did not also block CCBot, leaving the most upstream feed open.

robots.txt recipe

User-agent: CCBot
Disallow: /

Sources: Common Crawl · About CCBot

← Back to directory