PRACTICE

Stuck with legacy — when the CMS can't ship Product schema, the edge can

Most engineering blog posts about modernising a site for AI and search assume you can change the site. You can deploy a Next.js migration; you can rewrite the templates; you can ask the CMS team to add a Product schema field. In real enterprises, that conversation goes "scheduled for next year's release cycle", and the next year's cycle is somebody else's problem. Meanwhile the bots are visiting today.

This post is the operational answer for sites where the CMS or the legacy platform genuinely will not change in the planning horizon you have. The prerender layer can transform the response on its way out — inject Product / Offer schema the CMS won't ship, fix canonicals it gets wrong, turn its soft 404s into real 404s, serve different HTML to JavaScript-rendering vs HTTP-only bots, add hreflang it never learned. Per-bot, per-URL, all at the edge, none of it requiring the origin to know anything new.

1 · Five legacy pains the prerender layer can fix

The pattern of customer conversations that lead to this post:

  • The CMS templates predate Product schema. Adobe Experience Manager, SDL Tridion, and most pre-2020 Sitecore installations ship without modern JSON-LD. Adding it requires a template-change cycle that takes quarters.
  • Canonicals are wrong and the CMS won't fix them. URL-parameter sites, faceted-navigation sites, and any site with a long-tail of duplicate-content URLs. The CMS generates a canonical that points to the canonical of the canonical, and the SEO team has been asking for a fix since Q3 2024.
  • Soft 404s everywhere. Out-of-stock products return HTTP 200 with "this product is unavailable" body text. Old article URLs return 200 with the homepage. The crawl budget bleeds into pages that should be returning real 404s.
  • One bot can render JS, another can't, the CMS only knows the one mode. The site is a 2019 React SPA. Googlebot renders it; GPTBot and ClaudeBot don't. The CMS team can't ship the static fallback the AI bots need.
  • The site is international and lacks hreflang. Different language variants exist as separate URLs but there's no <link rel="alternate" hreflang="..."> set; the CMS doesn't have the concept.

Each of these is a problem the origin must, in theory, solve. In practice the prerender layer can solve all of them at the edge, faster than the origin team can ship a migration, while the migration is being planned.

2 · The pattern — transform at the prerender step, not at the origin

The architecture is the one PrerenderProxy already runs for the SSR-for-bots use case, with one extra step: between "Puppeteer rendered the page" and "respond to the bot", we run a transformation pipeline that mutates the rendered DOM to add/fix/strip whatever the legacy origin couldn't produce. The transformation rules live in version-controlled config alongside the site's prerender settings; they don't deploy to the origin.

┌────────────┐    ┌──────────────┐    ┌──────────────┐    ┌─────┐
│ Verified   │ →  │ Puppeteer    │ →  │ Transform    │ →  │ Bot │
│ bot request│    │ renders the  │    │ pipeline     │    └─────┘
│ (rDNS-ok)  │    │ legacy SPA   │    │  - inject    │
└────────────┘    │ as the user  │    │    JSON-LD   │
                  │ would see it │    │  - fix canon │
                  └──────────────┘    │  - hard 404  │
                                      │  - hreflang  │
                                      │  - strip     │
                                      │    noindex   │
                                      └──────────────┘

Critical constraint: the transformation must not change the user-visible content. Adding a JSON-LD <script> tag that the user's browser would already render the same way? Fine — that's not cloaking, that's content the CMS should have shipped. Inventing a different price for the bot than the user gets? That's cloaking and we don't ship it. The §5 section addresses the dividing line directly.

3 · Six concrete transformations

3.1 · Inject Product / Offer / BreadcrumbList JSON-LD

The CMS rendered the product page with the price visible in a <span class="price">, the SKU in a data-attribute, and breadcrumbs in a <nav class="crumbs">. The values exist in the DOM — they just aren't expressed as schema. Read them, emit them:

// transform-pipeline/inject-product-schema.js
// Runs after Puppeteer rendering, before the response is sent to the bot.
// Site-specific selectors are configurable; the structure here is defensive
// because legacy DOMs vary and an emitted-wrong schema is worse than no schema.
module.exports = function injectProductSchema($, ctx) {
  if (!ctx.url.match(/^\/products\//)) return;

  const sel = ctx.selectors?.product || {
    name:  "h1.product-title",
    price: ".price",
    sku:   "[data-sku]",
    image: ".product-gallery img",
    stock: ".stock"
  };

  const name = $(sel.name).text().trim();
  const sku  = $(sel.sku).attr("data-sku") || $(sel.sku).text().trim();
  if (!name || !sku) return;  // no fabrication: emit only when sourced

  // Price: extract a normalized decimal. Reject ambiguous strings rather than guess.
  const priceText = $(sel.price).first().text();
  const priceNum  = extractPrice(priceText);  // returns { value, currency } | null

  // Availability: only set when the DOM is unambiguous. "Out of stock", "Not in stock",
  // "More on the way" → OutOfStock. "In stock", "Available now" → InStock.
  // Anything else → OMIT the field. Inferring availability from absence creates
  // exactly the InStock/OutOfStock mismatch Google penalizes.
  const stockText = $(sel.stock).text().toLowerCase();
  let availability = null;
  if (/\b(out of stock|not in stock|sold out|unavailable|more on the way)\b/.test(stockText)) {
    availability = "OutOfStock";
  } else if (/\b(in stock|available now)\b/.test(stockText)) {
    availability = "InStock";
  }

  // Image: only emit if the resolved URL is plain http(s). Skip data:/javascript:/etc.
  const imgRaw = $(sel.image).first().attr("src");
  let imageUrl = null;
  if (imgRaw) {
    try {
      const u = new URL(imgRaw, ctx.origin);
      if (u.protocol === "http:" || u.protocol === "https:") imageUrl = u.toString();
    } catch (_) { /* ignore */ }
  }

  const offer = { "@type": "Offer", "url": ctx.origin + ctx.url };
  if (priceNum) {
    offer.price = priceNum.value;
    offer.priceCurrency = priceNum.currency;  // sourced from the page, not defaulted
  }
  if (availability) offer.availability = "https://schema.org/" + availability;

  const jsonld = { "@context": "https://schema.org", "@type": "Product", name, sku };
  if (imageUrl) jsonld.image = imageUrl;
  // Only attach offers if we actually have price OR availability — empty offer is noise.
  if (offer.price || offer.availability) jsonld.offers = offer;

  // Escape so a body containing  can't break out of the tag,
  // and U+2028/U+2029 don't break JSON parsers.
  const json = JSON.stringify(jsonld)
    .replace(/<\/script/gi, "<\\/script")
    .replace(/
/g, "\\u2028")
    .replace(/
/g, "\\u2029");

  $("head").append('');
};

Three safety rails the production code holds to. The availability field is omitted rather than defaulted to InStock or OutOfStock when the DOM is ambiguous — defaults invent product state the user-page doesn't have, which is the exact mismatch Google's Merchant systems penalize. The currency is read from the page (via extractPrice), never defaulted to a hard-coded ISO code. The image URL is whitelisted to http(s): only — a legacy template that emits data: or javascript: URLs in its image attribute is a real and observed failure mode. And the JSON is escaped to handle the </script> injection vector; the U+2028 / U+2029 separators are valid inside application/ld+json per spec, but they break JavaScript string literals when the same JSON ends up embedded in inline JS — escaping them defensively keeps both rendering paths safe.

3.2 · Fix canonicals at the edge

// transform-pipeline/fix-canonical.js
module.exports = function fixCanonical($, ctx) {
  // The CMS sometimes emits a canonical pointing at a tracker-laden URL.
  // Strip known tracking params and self-reference.
  const url = new URL(ctx.origin + ctx.url);
  ["utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
   "gclid", "fbclid", "msclkid", "ref"].forEach(p => url.searchParams.delete(p));

  // Also: collapse trailing slash style to match the canonical you actually want.
  let canonical = url.toString();
  if (canonical.endsWith("/") && canonical !== ctx.origin + "/") {
    canonical = canonical.replace(/\/$/, "");
  }

  $('link[rel="canonical"]').remove();
  $("head").append(``);
};

3.3 · Soft 404 → hard 404 for crawlers

Detect the soft-404 signature (status 200 with the legacy "this product is unavailable" body) and rewrite the response status before delivering to the bot. Users keep their friendly message; crawlers get the right signal.

// transform-pipeline/hard-404.js
// A naive "body contains 'page not found'" match catches legitimate articles
// about 404 errors. Use multiple converging signals before rewriting status,
// and only on path patterns where you have audited the false-positive surface.
module.exports = function hard404($, ctx, response) {
  if (response.status !== 200) return;
  if (!ctx.url.match(/^\/products\//)) return;  // scoped: only product paths

  // Extract content text without scripts/styles, and only from main content area.
  const mainText = $("main, [role=main], .product-detail, body")
    .first().clone().find("script, style, noscript").remove().end()
    .text().toLowerCase();

  // Signal 1: explicit unavailability phrasing
  const phraseHit = /\b(this product is no longer (offered|available)|item no longer offered|product has been discontinued)\b/.test(mainText);

  // Signal 2: no product schema-relevant DOM (the legacy page collapsed to a stub)
  const noProductBody = $(".product-detail, .product-buy, [data-sku]").length === 0;

  // Signal 3: title or H1 explicitly says "not found"
  const titleHit = /(not found|no longer available)/i.test(($("h1").first().text() + " " + $("title").text()));

  // Require at least TWO of the three signals before mutating status.
  const score = [phraseHit, noProductBody, titleHit].filter(Boolean).length;
  if (score >= 2) {
    response.status = 404;
    // Replace the verbose body with a minimal 404 so the crawler doesn't
    // index thin soft-404 content as a real page either.
    $("body").empty().append("

Not Found

This product is no longer available.

"); } };

The user path is untouched — a real user hitting a discontinued-product page still sees the friendly message. Only the bot path gets the 404 status. Important distinction: this transform is for permanently discontinued products (the product is gone, the URL is no longer valid). For temporarily out-of-stock products — where the URL still represents a real product that may return — the right answer is HTTP 200 with Product/Offer JSON-LD carrying availability: OutOfStock, which the schema-injection transform in §3.1 already handles. Conflating the two is the failure mode that turns a routine inventory dip into a deindexing event. The two-signal threshold + scoped path filter + "discontinued" / "no longer offered" phrase requirements above are what keep the rule narrowly aimed at the permanent case.

3.4 · JS-rendering vs HTTP-only — per-bot response shaping

Googlebot, Bingbot, Applebot all render JavaScript. GPTBot, ClaudeBot, PerplexityBot do not. For a CMS that ships a 100% client-side SPA, the JS-renderers can hydrate it; the HTTP-only bots see an empty shell.

The fix is the prerender layer rendering once (Puppeteer-style), then serving:

  • HTTP-only bots: the rendered HTML, fully populated, no JS bundle reference (or a stripped one).
  • JS-rendering bots: the same rendered HTML plus the JS bundle, so they can verify the hydrated DOM matches and feel comfortable indexing.
  • Real users: untouched. They get the original SPA.
// transform-pipeline/per-bot-shape.js
const HTTP_ONLY_BOTS = ["gptbot", "claudebot", "ccbot", "bytespider",
                       "meta-externalagent", "oai_searchbot", "claude-searchbot",
                       "perplexitybot"];

module.exports = function perBotShape($, ctx) {
  if (HTTP_ONLY_BOTS.includes(ctx.bot_vendor)) {
    // Strip the SPA JS bundle from the response — the rendered HTML already
    // has everything they will consume. Saves bandwidth on both sides.
    $('script[src*="bundle"]').remove();
    $('script[type="module"]').remove();
  }
  // JS-renderers and users: leave the response alone. Hydration verifies parity.
};

3.5 · Inject hreflang for sites the CMS never taught it

If you have language-variant URLs but no <link rel="alternate" hreflang="...">, ship them from the edge based on a mapping:

// transform-pipeline/inject-hreflang.js
// Two corrections that matter: (1) parse the locale segment properly — /en/ and
// /en-us/ are both valid and slice(3) silently breaks the second; (2) verify
// each alternate URL exists in your manifest before emitting — fabricated
// hreflang alternates that resolve to 404s or the wrong page is penalised
// as "spammy hreflang architecture".
const LOCALES = ["en", "en-US", "de-DE", "fr-FR", "es-ES"];   // configurable
const SEG_RX  = /^\/([a-z]{2}(?:-[a-z]{2})?)\//i;             // matches /en/ AND /en-us/

module.exports = function injectHreflang($, ctx) {
  const m = ctx.url.match(SEG_RX);
  const currentLocale = m ? m[1].toLowerCase() : null;
  const restOfPath    = m ? ctx.url.slice(m[0].length - 1) : ctx.url;

  // ctx.localeManifest is a lookup of {locale: [allowed paths]} built nightly
  // from sitemap.xml. Without it, fall back to emitting only the current locale.
  const manifest = ctx.localeManifest || null;

  $('link[hreflang]').remove();
  for (const code of LOCALES) {
    const prefix = "/" + code.toLowerCase().replace("-", "-") + "";
    const href = ctx.origin + prefix + restOfPath;
    // Skip emitting if the manifest doesn't confirm the page exists in this locale.
    if (manifest && !manifest[code]?.includes(restOfPath)) continue;
    $("head").append(``);
  }
  // x-default: point at the canonical locale (typically en or root).
  const xDefaultHref = ctx.origin + restOfPath;
  $("head").append(``);
};

3.6 · Strip stale or incorrect noindex / nofollow tags

Common after a SEO migration where the new templates correctly omit noindex but the old SPA still emits it from a hard-coded place no one wants to find. Strip it at the edge while the underlying issue gets ticketed:

// transform-pipeline/strip-stale-noindex.js
module.exports = function stripStaleNoindex($, ctx) {
  // Only run on URL patterns where you've confirmed the noindex is wrong.
  if (!ctx.url.match(/^\/articles\/(?!archive\/)/)) return;

  $('meta[name="robots"][content*="noindex"]').remove();
  $('meta[name="googlebot"][content*="noindex"]').remove();
};

4 · The transformation pipeline runs in a deterministic order

The order of transformations matters: fix-canonical after inject-hreflang would create canonicals that conflict with the alternates; hard-404 before inject-product-schema would emit Product JSON-LD for a page about to return 404. The pipeline is staged:

1. hard-404                  // mutate status first; bail early on 404
2. strip-stale-noindex       // remove false noindex from the rendered DOM
3. inject-product-schema     // emit machine-readable equivalent of user-visible state
4. fix-canonical             // canonicalise (knows about hreflang's existence next)
5. inject-hreflang           // alternate links (must run after canonical)
6. per-bot-shape             // last step: shape the final response per bot UA

Each stage is idempotent — running it twice produces the same DOM — and the pipeline is run inside a single Puppeteer session per request, so transformations cannot race against each other.

5 · The cloaking question — answered directly

The natural reaction reading the patterns above: "isn't this cloaking? Won't Google penalize it?" The honest answer is: only if you make it cloaking. The line is drawn around content, not markup.

Three rules we ship with and don't bend:

  1. The values you emit must be values the user can already see. The Product schema's price comes from the user's price tag. The hreflang URLs point at pages the user can also visit. The canonical points at a URL the user-path also serves. Don't invent prices, don't fabricate features, don't claim availability that isn't there.
  2. The user-path content is the floor. Anything we emit to a bot that the user wouldn't see in their hydrated DOM is forbidden. Adding a JSON-LD <script> with values already on the page? Fine — that script renders in the user's browser too. Inserting a paragraph of keyword-stuffed prose that the user never reads? Cloaking. Don't ship the rules engine without ratchet checks for the second case.
  3. The status-code rewrites are narrowly scoped. Converting a soft-404 to a hard-404 for a bot is fine and recommended — Google specifically advises returning a 404 for unavailable products. Converting a 200 to a 404 for the user-path would break the site. The rule that catches both: only mutate the status code on the bot-path of the transformation pipeline.

Google's own guidance on dynamic rendering, even after the 2024 deprecation, explicitly permits "the same content in a different form" — which is exactly what every transformation above does. The patterns are not a workaround to a rule, they are the rule applied. The historical context in our dynamic-rendering essay covers the policy in more detail.

6 · When NOT to use this — migrate instead

Three conditions where transformation-at-the-edge is the wrong choice and migration is what you should actually do:

  • You have the engineering capacity to migrate. If the CMS team can ship Product schema in a quarter, just ship it. The edge transformation should be a bridge, not a destination.
  • The content itself is wrong, not the form. If the CMS lists incorrect product titles or stale descriptions, no edge transformation fixes that — you'll be propagating wrong content to bots faster than to users. Fix the source.
  • The legacy platform is being decommissioned in the next 12 months. Edge transformations require operational maintenance. If the platform is going away, the right move is to not invest in the legacy stack's bot-readability at all and put that engineering time into the migration.

For the cases that remain — older enterprise CMS, vendor platforms with rigid templates, sites mid-migration where the legacy property still needs to keep its SEO and AI visibility for another 18 months — the edge transformation pattern is the high-leverage move. The work fits in a sprint, not a quarter, and the result is indexable HTML that no one had to ship from the origin to produce.

7 · Closing

The legacy stack is not a permanent loss. The CMS can't ship Product schema; the prerender layer can. The CMS won't fix the canonical; the prerender layer will. The CMS treats soft-404s as 200s; the prerender layer makes them 404s for crawlers. The architectural pattern that makes this safe is "transform the user-visible content into the form bots need, never invent content the user wouldn't see." The cloaking risk evaporates when the rule is held; the SEO and AI-visibility gains are immediate.

Related: Verifying that the bot really is the bot · Dynamic rendering history · Brand robots.txt strategy · Bot Directory