Bot reference catalog

One-page summaries of 30 common search, AI/LLM, and social-preview crawlers.

← bot index · audit index

CCBot (Common Crawl)

Vendor
Common Crawl
Type
AI training
JavaScript rendering
no
Honors robots.txt
yes
VendorCommon Crawl Foundation (501(c)(3) nonprofit)
TypeOpen web crawler — produces freely-available web corpus used by most AI training datasets
robots.txt tokenCCBot
JavaScript renderingNo
Honors robots.txtYes
Vendor docscommoncrawl.org/ccbot

User-Agent string

CCBot/2.0 (https://commoncrawl.org/faq/)

Purpose

Common Crawl is a nonprofit that crawls the public web monthly and releases the resulting corpus (~3 PB compressed per snapshot) as a freely downloadable dataset on AWS S3. The dataset is the foundational training corpus behind essentially every large language model trained through 2024 (GPT-3, GPT-4, Claude 1/2, LLaMA 1/2/3, and most open LLMs cite Common Crawl as a major source).

Therefore: blocking CCBot is the most upstream way a site can opt out of LLM training broadly — but the opt-out only affects future crawls and future training runs. Content already in past Common Crawl snapshots remains in those datasets.

Quirks

CCBot, leaving the most upstream training feed open.

How to allow / block

To opt out of LLM training upstream:

User-agent: CCBot
Disallow: /

Sources