What to log when you serve bots — PrerenderProxy schema + Combot.ai integration

An nginx access log is a single line of who-asked-what. That's enough to debug a 500. It is not enough to operate a prerender layer: when something goes wrong — a vendor IP allowlist drifts, an AI bot starts pulling 10× the normal volume, the cache hit rate quietly collapses overnight, the bot HTML stops matching the user HTML — the access log shows you the symptom (a 500, a slow response, a backend error) but not the cause. The cause lives in the bot-classification path, the verification result, the cache-key dimension, the render-step duration, and the SHA of the served body. Those fields don't exist in the default log format. You have to emit them.

This is the schema PrerenderProxy emits, the pipeline that gets it into Combot.ai as another data source alongside Search Console and GA, the five anomaly classes that actually warrant pager-grade alerts, and the alert thresholds we ship with. Code samples are pasteable starting points.

1 · Why the access log isn't enough

A request that ended in HTTP 200 can still be a quiet failure: a bot was served the SPA shell instead of the prerendered HTML because the rDNS lookup timed out and the system fell through to the default route. The access log shows 200, the body length looks plausible, the response time is fine. The bot just got a useless empty <div id="root"> instead of the indexed page. You only find out about that failure mode when GSC starts showing "indexed but no content" three weeks later, or when ChatGPT's recommendations stop including you.

The fix is structured logging at the prerender layer specifically. Every classification decision, every verification outcome, every render-or-cache choice, every served-body fingerprint gets a field. The log line becomes a single JSON object that fully describes what happened and why.

2 · The PrerenderProxy event schema

Every served request emits one JSON line on the internal prerenderproxy.events stream. The schema is intentionally flat — nested objects are hostile to most log indexers, and the few fields that would benefit from nesting (the verification chain, the cache details) are encoded as dotted keys or short enums.

{
  "ts":              "2026-05-19T08:42:13.471Z",
  "site":            "example.com",
  "request_id":      "01HXY5W3K4M7BRDZ9T0Q3Z4V8R",
  "url":             "/products/widget-a",
  "method":          "GET",

  "client_ip":       "66.249.66.1",
  "client_ip_class": "datacenter",
  "geo_country":     "US",

  "ua_raw":          "Mozilla/5.0 ... Googlebot/2.1 ...",
  "ua_claim":        "googlebot",

  "bot_class":       "verified_bot",
  "bot_vendor":      "googlebot",
  "verification":    "both",
  "verify_ms":       2,

  "route":           "prerender",
  "cache":           "hit",
  "cache_key":       "v1:googlebot:/products/widget-a",
  "cache_age_s":     1843,

  "render_ms":       null,
  "render_origin":   null,
  "render_attempts": 0,
  "render_queue_ms": 0,
  "concurrent_renders": 4,
  "error_code":      null,
  "error_reason":    null,

  "status":          200,
  "body_bytes":      52341,
  "body_sha256":     "a4f7b...",
  "drift_vs_user":   false,

  "served_by":       "edge",
  "duration_ms":     34
}

Enum values: client_ip_class ∈ {datacenter, residential, mobile, tor, unknown}; bot_class ∈ {verified_bot, spoofed_ua, user, unknown}; verification ∈ {ip_range, rdns, both, none}; route ∈ {spa, prerender, legacy_passthrough, blocked}; cache ∈ {hit, miss, stale, bypass, error}; served_by ∈ {edge, shield, origin}. error_code follows Puppeteer/Chrome conventions (ERR_TIMEOUT, ERR_NAVIGATION, ERR_CHROME_CRASH, …).

The fields that don't exist in a vanilla nginx log and that you should not skip:

bot_class + verification — was this a real bot? How did we decide? spoofed_ua is the catch for "claimed to be Googlebot but the IP didn't match"; that bucket is the highest-cardinality anomaly signal once your traffic is normal.
route — which decision we took. prerender means the request was handed off to the Puppeteer fleet; spa means we passed through unchanged; legacy_passthrough means the origin SPA was served because the prerender backend was unhealthy; blocked means we 4xx'd at the edge.
cache + cache_key + cache_age_s — the cache state and the actual key used. The key encodes the prerender-cache axes (vendor, mobile/desktop, locale) so debugging "why did this bot get an old snapshot" becomes a single SELECT.
render_ms + render_queue_ms + concurrent_renders — the Puppeteer step's wall time, the time it spent waiting for a free worker, and the fleet occupancy at the time of render. render_ms creep on its own is ambiguous (memory leak vs. load saturation); the three fields together disambiguate cleanly.
error_code + error_reason — Puppeteer / Chrome failure specifics on the prerender path. ERR_TIMEOUT ≠ ERR_CHROME_CRASH ≠ ERR_PROXY_BLOCK ≠ ERR_UNHANDLED_REJECTION; without these fields the incident channel is a guessing game.
body_sha256 + drift_vs_user — the SHA of the served body, and whether it matches the most recent SHA the system observed for the same URL on the SPA path. drift_vs_user = true with no recent deploy is the alarm that says "your bot version and your user version are diverging".

3 · The pipeline — file → Vector → Elasticsearch → Combot.ai

Where the events flow:

┌───────────────┐       ┌─────────┐       ┌──────────────────┐
│ Fastly VCL    │──┐    │ Vector  │──┐    │ Elasticsearch    │
│ vcl_log       │  │    │ pipeline│  │    │ combot-prp-*     │
└───────────────┘  ├──→ │ (parse, │  ├──→ │  (90d retention) │──┐
┌───────────────┐  │    │ enrich, │  │    └──────────────────┘  │
│ nginx custom  │──┤    │ route)  │  │                          │
│ log_format    │  │    └─────────┘  │    ┌──────────────────┐  ├──→ Combot.ai
└───────────────┘  │                  ├──→ │ Combot.ai ingest │──┘    analytics
┌───────────────┐  │                  │    │ /webhook/prp     │       layer
│ Puppeteer     │──┘                  │    └──────────────────┘
│ stderr JSON   │                     │
└───────────────┘                     │    ┌──────────────────┐
                                      └──→ │ S3 cold archive  │
                                           │ (parquet, 2y)    │
                                           └──────────────────┘

Three emitters (Fastly VCL log directive, nginx log_format, the Puppeteer service's stderr) all produce the same JSON shape — the schema is enforced in the Vector pipeline, which drops or re-types malformed events rather than letting them leak into the index.

Vector's config is the only piece that actually deserves to be in version control. The relevant snippet:

# /etc/vector/prerenderproxy.toml
[sources.fastly_log]
  type    = "syslog"
  address = "0.0.0.0:5140"
  mode    = "tcp"

[sources.nginx_log]
  type    = "file"
  include = ["/var/log/nginx/prerenderproxy-*.json"]
  read_from = "end"

[transforms.parse]
  type    = "remap"
  inputs  = ["fastly_log", "nginx_log"]
  # IMPORTANT: parse_json! aborts on non-JSON; use the fallible form so
  # malformed lines fall through to ?? .  rather than being dropped.
  source  = '''
    . = parse_json(.message) ?? .
    .schema_version = "1"
    if !exists(.bot_class) { .bot_class = "user" }
    if .bot_class == "spoofed_ua" {
      .anomaly_hints = push((.anomaly_hints ?? []), "ua_no_ip_match")
    }
  '''

# Privacy/GDPR: mask client_ip and strip known PII query params before cold archive.
[transforms.privacy_mask]
  type    = "remap"
  inputs  = ["parse"]
  source  = '''
    if exists(.client_ip) {
      .client_ip = match(string!(.client_ip), r'^(\d+\.\d+\.\d+)\.\d+$') ?? .client_ip
      if is_array(.client_ip) { .client_ip = .client_ip[0] + ".0/24" }
    }
    if exists(.url) {
      # VRL doesn't ship a strip_query_params helper, so parse the URL,
      # delete the listed PII keys, and rebuild. Wrap in a try block so
      # malformed URLs fall through to the unmodified value.
      parsed, err = parse_url(string!(.url))
      if err == null && exists(parsed.query) {
        for_each(["token", "session", "auth", "key", "password", "email"]) -> |_i, k| {
          parsed.query = remove!(parsed.query, [k])
        }
        .url = encode_url(parsed)
      }
    }
  '''

[sinks.elasticsearch]
  type     = "elasticsearch"
  inputs   = ["parse"]
  endpoints = ["https://es-internal:9200"]
  bulk.index = "combot-prp-%Y.%m.%d"

[sinks.combot_webhook]
  type     = "http"
  inputs   = ["parse"]
  uri      = "https://combot.ai/api/v1/ingest/prerenderproxy"
  encoding.codec = "ndjson"
  # At >10k req/sec a 500-event batch fires ≥20 POSTs/sec — too chatty.
  # 10–50k batches keep the destination API happy. For sustained >50k req/sec,
  # switch this sink to Kafka / Kinesis / Pub-Sub instead of an HTTP webhook.
  batch.max_events   = 25000
  batch.timeout_secs = 5
  auth.strategy = "bearer"
  auth.token    = "${COMBOT_INGEST_TOKEN}"

[sinks.s3_cold]
  type    = "aws_s3"
  inputs  = ["privacy_mask"]   # PII-masked before cold storage
  bucket  = "combot-prp-archive"
  encoding.codec = "ndjson"
  compression    = "gzip"
  batch.max_bytes  = 268435456
  batch.timeout_secs = 600

The fan-out matters operationally. Elasticsearch is the hot path for the last 90 days, used by the on-call dashboard and incident queries. Combot.ai gets the same stream as a webhook ingest so it can compose bot-traffic events with GSC, GA, and AI brand-mention data inside its own analytics. S3 is the cold archive: ndjson + gzip, two-year retention, for compliance and the rare deep historical query. The privacy_mask transform sits in front of the cold sink so client-IP /24-truncation and PII-query-param stripping happen before anything lands in long-term storage — a GDPR / CCPA data-minimization requirement, not a nice-to-have.

4 · What Combot.ai computes on top

Combot's job is not to be another Elasticsearch — it consumes the PrerenderProxy event stream alongside Search Console, GA4, and its own AI-visibility tracking layer, and produces a small number of high-value composite metrics that none of those data sources can produce alone:

Render-to-citation ratio. For each AI vendor: how many fetches that vendor's bot made × how often the brand was cited in that vendor's chat answers (Combot's brand-visibility index). The crawl-to-referral ratio for the AI era, sourced from the only two places you can actually measure it.
Bot share of voice attribution. When a brand-mention spike shows up in ChatGPT this week, Combot can correlate it backwards to specific PrerenderProxy events: "we served Googlebot 1.4× more pages this week, and Claude-User picked up the corresponding pages 11 days later". The latency between training-time fetch and parametric appearance is one of the metrics Combot can produce.
Drift × ranking correlation. If drift_vs_user = true spikes for a URL set, did the same URL set drop a position in GSC two weeks later? The hypothesis is testable; Combot is where the join lives.
UA-spoof clustering. bot_class = "spoofed_ua" events grouped by source ASN — when a scraping operator pretends to be Googlebot from a residential proxy pool, the spike in this metric is visible inside an hour.

5 · The five anomalies that warrant pager-grade alerts

Most "AI is interesting" dashboards collect hundreds of metrics that no one watches. These are the five that matter; ship them with thresholds, alert on them, ignore the rest until you have time.

5.1 · Spoofed-UA rate spike

Definition: bot_class == "spoofed_ua" as a fraction of total bot-claim requests, per 15-minute window.

Why: a real attacker is testing your verification, or a scraping operator just turned on a new proxy pool. The metric is normally near-zero; a 10× spike is structurally informative.

# Elasticsearch DSL fragment. Requires an index template mapping
# bot_class as { "type": "keyword" } (dynamic mapping makes it text by default).
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "range":  { "ts": { "gte": "now-15m" } } },
        { "exists": { "field": "ua_claim" } }
      ]
    }
  },
  "aggs": {
    "by_class": { "terms": { "field": "bot_class" } }
  }
}
# Alert: spoofed_ua/(verified_bot+spoofed_ua) > 0.10 for 2 consecutive windows
# Numerator/denominator: this query buckets both; compute the ratio in your alert rule.

5.2 · Cache hit rate collapse

Definition: cache == "hit" share over rolling 1h.

Why: a regression in the cache-key logic, a TTL misconfiguration, or a sudden surge of long-tail URLs from a new crawler. Steady-state should be ≥85%; below 60% the system is essentially rendering everything live and the origin load curve will reflect that within minutes.

5.3 · Render-step p95 creep

Definition: p95 of render_ms where render_origin != null, per 5-minute window.

Why: Puppeteer memory leaks present as a slow walk upward in p95 render time, hours before they present as OOM kills. Alert at +30% over the rolling 24h baseline.

5.4 · Drift vs user version

Definition: count of drift_vs_user == true per URL set per 1h.

Why: the single failure mode that gets dynamic-rendering setups penalized in search. Page level granularity matters; an alert needs to fire on "this specific URL drifted", not just "drift count went up".

5.5 · AI-bot zero-fetch (per vendor, per day)

Definition: count of verified_bot events where bot_vendor == 'gptbot' (or claudebot, perplexitybot, etc.) per day.

Why: if GPTBot has not fetched your sitemap in 72 hours and your robots.txt didn't change, something upstream broke. The robots.txt got mis-edited, your IP rotated and OpenAI's allowlist hasn't refreshed, or the bot vendor changed its UA format and your verification regex didn't catch up.

6 · Alert thresholds we ship with

spoofed_ua_rate     > 10%   for 2x15min windows  AND total_bot_claims > 500   → page
cache_hit_rate      < 60%   for 1h               AND total_requests   > 5000  → page
render_p95_creep    > 30%   over 24h baseline    AND queue_ms_p95 stable      → ticket (true leak)
render_p95_creep    > 30%   over 24h baseline    AND queue_ms_p95 also up     → ticket (saturation, scale fleet)
drift_vs_user_count > 5     per URL per 1h                                    → ticket
ai_bot_zero_fetch   72h     per vendor                                        → ticket
gptbot_volume       > 3x    baseline daily                                    → ticket (informational)
total_event_rate    < 50%   of expected baseline                              → page (logging pipeline dead)

The throughput floors on the first two thresholds are not cosmetic — they're what stops the on-call from getting woken up at 03:00 because a single long-tail URL got crawled with no other traffic in the 1h window. The render-creep alert is doubled because the same symptom (rising render_ms p95) has two unrelated causes — a Chromium memory leak versus a saturated fleet — and the render_queue_ms p95 cleanly tells them apart. The last threshold is the meta-alert: if the events stop coming, the other alerts can't fire. Heartbeat the pipeline itself.

7 · Where Combot.ai picks up the thread

Once the events are flowing into Combot, the analytics layer composes the bot stream with everything else Combot already tracks. A representative query someone might actually run, expressed as the Combot natural-language layer would interpret it:

"In the last 30 days, which of our product URLs saw the largest delta between
PrerenderProxy fetch count (Googlebot + GPTBot + ClaudeBot + Perplexity-User)
and GSC impressions? Limit to URLs where drift_vs_user has been false the
entire window. Plot fetch count vs impressions, weekly."

The Combot side answers this; the PrerenderProxy side made it answerable by emitting the events with the right fields in the first place. That's the whole loop.

8 · Closing

The work that earns "we know what our bots are doing" status is not the dashboard — it's the moment-of-truth structured log. Once every served request emits the schema above, the dashboard writes itself in a weekend; the anomaly detection is a few queries; the AI brand-visibility correlation falls out of the join with Combot. Skip this step and you are operating a prerender layer with the same instruments as a static-file webserver, and the failure modes that matter for bot operations are invisible to you.