CCBot (Common Crawl)

Vendor: Common Crawl
Type: AI training
JavaScript rendering: no
Honors robots.txt: yes


Vendor	Common Crawl Foundation (501(c)(3) nonprofit)
Type	Open web crawler — produces freely-available web corpus used by most AI training datasets
robots.txt token	`CCBot`
JavaScript rendering	No
Honors robots.txt	Yes
Vendor docs	commoncrawl.org/ccbot

User-Agent string

CCBot/2.0 (https://commoncrawl.org/faq/)

Purpose

Common Crawl is a nonprofit that crawls the public web monthly and releases the resulting corpus (~3 PB compressed per snapshot) as a freely downloadable dataset on AWS S3. The dataset is the foundational training corpus behind essentially every large language model trained through 2024 (GPT-3, GPT-4, Claude 1/2, LLaMA 1/2/3, and most open LLMs cite Common Crawl as a major source).

Therefore: blocking CCBot is the most upstream way a site can opt out of LLM training broadly — but the opt-out only affects future crawls and future training runs. Content already in past Common Crawl snapshots remains in those datasets.

Quirks

Volume is moderate — one full crawl per month, not continuous.
Honors robots.txt strictly.
Many sites that adopted "Block AI Bots" in 2024 did NOT also block

CCBot, leaving the most upstream training feed open.

How to allow / block

To opt out of LLM training upstream:

User-agent: CCBot
Disallow: /

Bot reference catalog

CCBot (Common Crawl)

User-Agent string

Purpose

Quirks

How to allow / block

Sources