Automated Regression Testing for Custom LLMs in 2026

Notes from the Release Cycle — Part 7

Friday at 4:47 PM you shipped a one-character prompt tweak. The aggregate eval score moved from 0.873 to 0.871 — well inside the noise floor. Monday morning your support queue is on fire over a class of queries you stopped looking at six months ago because they had been stable.

Nothing in the model regressed. The model is the same model. The eval drifted out from under you. Six months of slow growth in one customer segment never made it into the golden dataset, the judge prompt was last calibrated against humans in October, and the retrieval index quietly rebuilt itself last Wednesday on a refreshed embedding model.

This is what post 6 called out — the model is the right answer roughly one alert in seven. Which means your regression suite has to detect drift in itself, not just in the model. This post is the suite.

What is regression testing for a custom LLM, actually?

Software regression tests assert output == expected for fixed inputs. They work because the function is deterministic.

A language model is not a function in the same sense. The same prompt at temperature > 0 produces a distribution of valid completions, and “valid” is multi-dimensional: did it answer the question, is the answer grounded in the retrieved context, did it stay inside the safety envelope, did it come back inside the latency budget. So regression testing a custom LLM means measuring the distribution of behaviour against a frozen baseline distribution — across slices that matter to you, with judges that have been calibrated against humans, on inputs that look like your production traffic.

Three things have to be in place before any of this is meaningful:

A golden dataset that resembles production at the slice level, not in aggregate.
A calibrated judge — not “we use GPT-5 as judge,” but “we measured Spearman ρ ≥ 0.7 against three human raters, last refreshed last week.”
A baseline manifest — the exact model weights, prompt template, retrieval index, and judge version that scored what they scored. Without this you cannot tell whether the score moved because the model changed or because the ruler changed.

Divinci runs all three as first-class objects, hash-linked, scored on every commit. The rest of this post is how to assemble them.

Why most LLM regression suites fail to catch real regressions

The dominant 2026 failure mode for custom LLMs is what Tianpan’s Sigma Inference team named the Semver Lie in their April 2026 postmortem^[1]: an aggregate metric stays flat or improves, while one or two production slices silently regress. The slice was below 5% of traffic when the test was designed, so it never made it into the golden dataset; six months later it is 12% of traffic, the model degraded on it, and the aggregate number was never going to notice.

We have looked at every public LLM-release postmortem from the past eighteen months and the pattern repeats: the suite scored green because it scored the wrong thing. Specifically:

The golden dataset was hand-written by the team at launch and never re-stratified against shifted traffic distributions.
The LLM-as-judge prompt was set once and never re-calibrated against human labels. Judge agreement decayed silently^[2].
The baseline scores were stored as raw numbers, not as (model_sha, prompt_sha, judge_sha, dataset_sha, score) tuples — so when something regressed, no one could tell which of the four had moved.

A regression suite that does not solve all three of these is just a CI step that turns green at deploy time and gives you false confidence. The fix is not “more cases.” The fix is slice-aware, version-anchored, judge-calibrated measurement, on each release.

Build a golden dataset that survives slice-aware analysis

The four-bucket composition we ship by default — production samples 60%, adversarial 15%, expert-curated edge cases 15%, failure replays 10% — is a reasonable starting point. What makes it actually catch regressions is the slice metadata attached to every case.

Every entry in the dataset carries: input, expected behaviour (rubric, not exact string), retrieval context (if any), and a slice tag — domain, user segment, query intent, language, length bucket, whichever decompositions matter for your product. The suite scores per slice, and any slice that drops past its threshold blocks the release, even if the aggregate score went up.

Diagram is structural. Stratification axes and per-slice thresholds are configured per product in the Divinci release manifest. Internal — defined in our own deployments.

Two operational rules we have learned to enforce:

Resample quarterly. Production traffic distributions shift faster than most teams measure. We re-stratify the production-sample bucket against the last 90 days of traffic every quarter; if any slice grew past 5% of traffic and was under 2% of the golden dataset, it gets backfilled before the next release ships.

Every postmortem adds a case. A regression that reached production and was not caught is a case that was missing from the dataset. We add it to the replays bucket inside 48 hours of the postmortem and tag it with the slice that surfaced it.

How do you detect drift before users do?

There are four distinct kinds of drift, and a regression suite that watches only the last one is a regression suite that misses most regressions.

Drift type	What moves	Detection signal	Action
Quality drift	The judge’s score for a fixed slice	Per-slice Spearman ρ vs baseline drops	Block release; diagnose per post 6’s tree
Coverage drift	Production traffic distribution vs golden dataset distribution	KL-divergence between slice proportions	Resample golden dataset
Judge drift	Judge model agreement with humans	Spearman ρ vs a frozen human-labelled audit set	Recalibrate judge prompt or replace judge
Production drift	Live production scores vs offline scores on the same model	Production-trace replay score gap	Investigate retrieval / preprocessing / runtime

Quality drift is the one most suites measure; the other three are where Friday-afternoon regressions usually hide. Divinci tracks all four against the baseline manifest, with the per-slice score breakdown surfaced on every PR and a weekly judge-calibration job that flags drift before it accumulates.

Stylised reconstruction of the Tianpan Sigma postmortem pattern^[1] using internal Divinci slice nomenclature. Specific values are illustrative.

Multi-dimensional evaluation — score four things at once, per slice

A single composite score is a worse signal than four scalar scores. We gate on four dimensions:

Task completion — did the response actually answer the question, scored by a calibrated judge against a rubric. Slice-aware.
Faithfulness — for any response that referenced retrieved context, is every claim grounded in that context. Hallucination shows up here first.
Safety — refusal correctness, jailbreak resistance, PII / policy exposure. Almost always gates at ≥ 0.99 pass-rate; safety is a hard wall, not a soft trade-off.
Latency budget — p95 within the slice’s SLA. A prompt change that doubled tokens-per-response is a regression even if quality went up.

Each dimension has its own per-slice baseline and its own per-slice threshold. We never combine them into a single weighted scalar at gate time; we surface them as four scores per slice and block on whichever moved past its threshold first. A model that gained 4 points of task completion at the cost of 1 point of faithfulness on the medical slice is still a regression.

What gates should block a custom LLM deployment?

We run a three-layer architecture, each layer gating a different stage of the pipeline (see post 1 for the stage taxonomy).

Layer 1 — Smoke (every commit, ~90 seconds). Twenty to thirty critical cases drawn from the highest-impact slices. Catches catastrophic regressions before the full suite spends compute. If smoke fails, the rest does not run.

Layer 2 — Full suite (every PR, ~12 minutes). The complete golden dataset, scored per slice on all four dimensions. Slice-aware Spearman ρ against the baseline manifest. Threshold breach blocks merge. The PR comment lists exactly which slice on which dimension moved by how much, with five example failing cases.

Layer 3 — Baseline comparison (release candidates, ~25 minutes). The candidate model is replayed against the last 14 days of production traces — the closed-loop production-trace replay we shipped in post 1. The same calibrated judge that scores the golden dataset also scores the replay outputs. Any slice whose replayed scores diverge from the offline scores by more than its threshold blocks the release. This layer is what catches drift the golden dataset does not yet know about.

Wall-clock numbers are internal — measured on Divinci's production CI runners for a representative customer with ~500 golden-dataset cases and ~14 days of production traces.

Calibrate your judge before you trust a single score it produces

LLM-as-judge is what makes any of this scale past a few hundred cases. It is also where a regression suite quietly stops working, because the judge has no obligation to remain calibrated as it gets updated or as your data distribution moves.

We calibrate every judge prompt against a frozen human-labelled audit set of at least 100 cases stratified across the same slices as the golden dataset, and we re-run the calibration weekly. The bar we ship at is Spearman ρ ≥ 0.7 against the human-rater median, with Cohen’s κ ≥ 0.6 on binary safety judgments. Both of these are above the threshold where MT-Bench-style judges have been shown to track human raters at the level of inter-human agreement^[2].

When the weekly calibration drops below threshold, the judge is automatically retired and the on-call eval engineer is paged. The release pipeline holds open candidates rather than gating them on a judge that is no longer measuring what it used to measure.

# Run the weekly judge calibration job
curl -X POST https://api.divinci.ai/v1/regression/judges/calibrate \
  -H "Authorization: Bearer $DIVINCI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "judge_id":     "rubric-v7",
    "audit_set":    "human-labels-2026-04",
    "min_spearman": 0.70,
    "min_kappa":    0.60,
    "on_fail":      "retire_judge_and_page"
  }'

The Divinci differentiator — closed-loop production-trace replay

The Layer 3 gate is the part most regression suites do not have. The flow is the same flow we shipped in post 1, with one specialisation for regression testing: every release candidate has its score on the offline golden dataset compared, slice by slice, to its score on a 14-day window of replayed production traces. The golden dataset measures what we expected the model to do. The replay measures what the model would actually have done last week.

When those two scores diverge by more than the per-slice gap budget, the release is blocked. The mismatch is the signal: either the golden dataset is no longer representative (coverage drift), or the candidate behaves differently on traces shaped by production preprocessing and retrieval (production drift). Either way, you find out before users do.

The judge that scores the offline run is the same judge that scores the replay run. The audit log records both score sets, both judge versions, the trace IDs that were replayed, and the gap that fired the block. The gap itself is the most useful diagnostic signal we have, and it is what gets handed to whoever picks up the post 6 diagnostic tree next.

Anchor the golden dataset with a vIndex receipt

Every score in the suite is meaningless if you cannot reproduce it later. We hash the golden dataset on each release and chain that hash into a vIndex receipt alongside the model SHA, prompt SHA, judge SHA, and the calibration record. The receipt is externally anchorable — auditors can replay our exact regression run six months later and verify the scores we claimed.

{
  "release_id": "rel_3f1a-2026-05-26",
  "model": { "sha": "0c1f9…", "weights_uri": "r2://models/custom-v7.2", "open_weights": true },
  "prompt": { "sha": "c4a8e…", "template_id": "support-v3.4" },
  "retrieval": { "index_sha": "b21f0…", "embedder": "e5-mistral-7b-instruct" },
  "judge": { "sha": "d8e21…", "rubric_id": "rubric-v7", "spearman_vs_humans": 0.74 },
  "dataset": { "sha": "a90b1…", "n": 512, "slices": 17, "stratified_at": "2026-04-30" },
  "scores": { "aggregate": 0.872, "by_slice": { "/* … */": "/* per-slice scalars */" } },
  "replay": { "trace_window_days": 14, "n_traces": 8430, "max_gap": 0.018 },
  "vindex_anchor": "sha256:f0bfd2…",
  "verifiable_at": "https://vIndex.divinci.ai/rel_3f1a-2026-05-26"
}

Open-weights caveat. The receipt above carries weight provenance only when the model is open-weights — vIndex anchors the actual weight bytes. For closed-API model backings (OpenAI / Anthropic / Google managed models), the receipt still carries the decision chain — every gate score, every judge result, the calibration record — but the weight field is empty, and you cannot independently verify the model artefact. We say this in the receipt and in the compliance documentation so auditors do not get a false impression. The releases that benefit most from a full vIndex chain are the ones where you control the weights.

A four-phase implementation timeline that we have actually shipped

Teams that try to ship the full architecture in week one stall on tooling. The order below is the order that works.

Phase 1 — Baseline (week 1). Pull a stratified sample of the last 30 days of production traces. Have two engineers hand-label task completion on 100 cases each. Calculate the inter-rater agreement (target Cohen’s κ ≥ 0.6). The number you get is your starting human-baseline; everything else gets calibrated against this.

Phase 2 — Harness (weeks 2–3). Stand up the evaluation harness on the 100-case dataset. Add a calibrated judge against your human labels. Verify the harness reproduces the human scores within ρ ≥ 0.7. Most teams discover their first judge prompt fails this and re-write it twice — this is normal.

Phase 3 — Gates (weeks 3–4). Wire the harness into CI as a warning, not a block. Watch it for two weeks. The thresholds you discover by watching false-positive rates are the only thresholds that survive. Promote to blocking only when the false-positive rate is below 5%.

Phase 4 — Replay loop (ongoing). Once gates are blocking reliably, enable the production-trace replay layer. This is where the slice-coverage gap surfaces, and where every postmortem starts adding cases back into the golden dataset.

What this does not solve

Three honest limitations, the same way we have framed them every post in this series.

Suite drift is endless work. Regression testing is infrastructure, not a project. The golden dataset has to be re-stratified every quarter, the judge re-calibrated every week, the threshold budgets re-tuned every postmortem. There is no version of this where you ship a suite and walk away.
A perfectly calibrated judge is still a model. Spearman ρ = 0.74 against human raters means roughly a quarter of judge calls disagree with the human median. That residual disagreement is the noise floor on every score. We surface it explicitly in every release report; teams that forget it is there will be surprised by it eventually.
Closed-API backings cap how much you can verify. With a closed-API model, the regression suite measures behaviour but cannot verify weight provenance. If you need full reproducibility — regulated industries, audited deployments — the trade-off is on the model choice, not the suite.

Up next

Post 8, the last in this series, finishes the loop on the inside of CI. Where this post and post 5 were about what runs at the gates, the next one is about the CI layer that produces the candidates the gates score in the first place — pre-merge evaluation, contract tests for prompt templates, and how to size the CI fleet for a 12-minute eval suite without bankrupting the budget. It is the engineering layer underneath everything we have written about so far.

FAQ

What is the difference between LLM evaluation and LLM regression testing?

Evaluation measures whether a model meets a quality bar at a point in time, against an absolute rubric. Regression testing measures whether a candidate behaves the same as a frozen baseline, per slice, across multiple dimensions. The baseline is what makes it regression testing — Divinci ships both, and the regression mode pins (model_sha, prompt_sha, judge_sha, dataset_sha) so a moved score identifies which input moved.

How many cases should a golden dataset have?

Fewer than you think, stratified better than you think. We have shipped useful regression coverage with 200 cases on five well-defined slices and seen 5,000-case datasets that missed everything that mattered because they were unstratified. Start at 200, stratified, then grow the replay bucket case-by-case from postmortems.

Should I use human reviewers or LLM-as-judge?

Both, with humans calibrating the judge. Humans cannot keep up with the volume that a release-cycle CI gate needs to score. The judge fills the volume, the humans calibrate the judge — measured weekly with Spearman ρ ≥ 0.7. Either alone is a failure mode.

How do I test for non-deterministic outputs?

Score the distribution, not the string. Score with a rubric the judge can apply across phrasings, and run each input three to five times on temperature > 0 so the slice-aware score is over a distribution of completions rather than a single sample. Tighten temperature only for cases that genuinely need deterministic output (structured-output tool calls, classification).

What metrics should I prioritise for the first CI quality gate?

Task completion and one safety gate. Both per-slice. Adding more dimensions before the first two are calibrated produces noise; teams that ship more usually end up gating on the noise. Add faithfulness next when you turn on retrieval; add latency once the first two are stable.

References

Pan, Tianpan. "The Semver Lie: how a minor LLM update broke production." 29 April 2026. The named 2026 failure mode for slice-aware regression analysis; aggregate scores hold flat while a low-volume slice silently regresses.
Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685. Empirical evidence that strong LLM judges agree with human raters at roughly inter-human-agreement levels (≈ 80%) on open-ended tasks, with reported failure modes that calibrate-against-humans audits are designed to detect.
Kirkpatrick et al. "Overcoming catastrophic forgetting in neural networks." PNAS / arXiv:1612.00796. The foundational result on catastrophic forgetting in fine-tuned neural networks — why a fine-tuned custom LLM has to be regression-tested for general capability loss, not just gain on the target task.
Amazon Web Services. "SageMaker Deployment Guardrails — blue/green deployments and canary monitoring." The closed-API contrast: gates on infrastructure metrics (latency, errors, CPU) rather than on per-slice semantic quality.
Spearman, C. "The proof and measurement of association between two things." American Journal of Psychology, 15(1):72–101, 1904. The rank-correlation coefficient that anchors the slice-aware gate — robust to scoring-scale drift in the judge, which is the property we needed.
DORA / Google Cloud. "Accelerate State of DevOps — change-failure-rate and time-to-restore-service metrics." The cross-industry baseline for "how often deploys cause incidents" and "how fast you recover." Regression suites that block at the gate move the first metric down; instant rollback ([post 5](/blog/automated-llm-ci-cd-pipelines-with-instant-rollback/)) moves the second.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today