CI Testing for Custom Language Models in 2026

Notes from the Release Cycle — Part 8 (final)

You ship the regression suite from post 7. It works. The slice-aware gates catch real bugs. The calibrated judge holds.

Then your engineering lead asks how much it costs to run on every PR. You do the multiplication: ~12 minutes of judge inference per PR, 60 PRs a day, four dimensions × seventeen slices, and the bill is real money. Worse, every developer is now waiting 12 minutes for a green check on a one-line prompt typo. Velocity drops^[1], the team grumbles, someone proposes “just run the gates nightly” — which is precisely how you give up everything the gates were supposed to do.

The fix is not less testing. The fix is testing in layers, with most of the signal arriving in the first ninety seconds. This post is what runs underneath the gate suite: sub-second contract tests, a tight smoke layer, a cost-aware fleet, and a two-week shadow window before any new gate blocks anyone.

This is post 8, the last of this series. By the end you will have the full picture — from the four-stage pipeline down to the contract-test fixture that runs on every commit.

What does CI mean for a custom language model?

CI for a custom LLM is the work the gate suite does not have to repeat. The gate scores semantic quality; CI catches everything that would make the gate’s score meaningless before the gate spends a single judge token.

Contract tests run in milliseconds and verify that prompt templates still render, that tool-call schemas still parse, that retrieval indices still respond, that the manifest still references hashes that actually exist. They are deterministic, free, and the only reason the rest of the pipeline can afford to exist. A pull request that breaks the prompt template should fail in 200 ms, not after 12 minutes of judge inference scoring nonsense.

The contract layer is the difference between a CI bill that scales linearly with PR volume and one that does not. Divinci’s CI runner spends > 90% of its judge budget on real semantic evaluation, not on PRs that would have failed a schema check. That ratio is the headline number.

Why traditional CI breaks for LLMs — through the cost lens

Posts 1 and 7 covered why deterministic CI fails for a generative model. The version of that story this post is about is the cost of those four properties, not the existence of them.

Property of LLMs	Traditional-CI failure	Cost shape
Non-deterministic outputs	Exact-match assertions flake	Re-runs amplify cost linearly with flake rate
Multi-dimensional quality	Single boolean is uninformative	Each dimension is a separate (paid) judge call
Provider drift	Pinned `gpt-4-2024-01-01` quietly retires	Recalibration burst when a provider sunsets a checkpoint
Non-local prompt effects	Local unit test cannot catch the effect	Distribution-shape changes between PRs, not within them — needs whole-suite re-run, not delta

The CI architecture has to make each of these affordable. Contract tests handle property 1 and 3 cheaply. Smoke tests handle property 4 partially. Only the full suite handles property 2 — and only on the PRs that actually need it.

The CI layer cake — sub-second to twenty-five minutes

The architecture we ship is four layers, each one earning its compute by catching what the cheaper layers below cannot. The slice-aware framing of every layer follows the same lesson the Tianpan Semver Lie postmortem made explicit^[4]: aggregate signals lie; per-slice signals catch what aggregates hide.

Layer wall-clock, per-layer cost, and funnel ratios are internal — measured on Divinci production CI for a representative customer (~500 golden-dataset cases, 17 slices, ~60 PRs/day).

The cost shape is the design. ~74% of PRs never spend a judge token — contract or smoke is enough. The PRs that do reach the full suite are the ones that touched a prompt, a model config, a retrieval index, or evaluation code — exactly the changes where the gate suite is the only signal worth trusting. Release candidates are the small share that reaches Layer 4.

Contract tests — the unfair advantage

Contract tests are the first line, the cheapest line, and the line most teams skip because they feel beneath the dignity of an “AI evaluation pipeline.” They are also where 30–40% of would-be regressions actually fail in our customers’ suites, before a single judge has been called.

The contract layer asserts five things and nothing else:

Prompt-template render. Every template renders against a canonical fixture without unbound variables, runaway loops, or broken Jinja-style includes.
Tool-call schema. Every declared tool’s argument schema parses, the JSONSchema is valid, and the rendered prompt actually references all required slots.
Manifest integrity. Every SHA in the release manifest — model, prompt, retrieval index, judge, dataset — corresponds to an artifact that exists in the registry. Dangling pointers fail here, not three layers in.
Index liveness. The retrieval index responds to a known query within budget. A rebuilt index that quietly broke retrieval surfaces here, not in production.
Denylist & token-budget. Any prompt template that introduced a forbidden token, blew the per-call token budget, or rendered past the context window fails here. Heuristic semantic-similarity scoring^[6] is also cheap enough to run at the contract layer for fuzzy-match denylist coverage where literal-string matching is insufficient.

# A representative contract test invocation — runs in roughly 600 ms
divinci ci contract \
  --manifest release/staging.yaml \
  --check schema,template,manifest,index,denylist \
  --fail-fast \
  --json-out /tmp/contract-report.json

None of these calls a judge. None of them is non-deterministic. None of them costs measurable money. And every one of them rules out an entire class of “the gate suite said the medical slice regressed” alerts that would have wasted a full 12 minutes of judge inference scoring output the model never could have produced correctly in the first place.

The smoke layer — 90 seconds, ~$0.05 per PR

If the contract layer is the cheap unfair advantage, the smoke layer is the one that actually catches regressions for less than the price of a coffee. Twenty to thirty cases drawn from the highest-volume slices, scored on task completion and safety only, no faithfulness, no latency, no retrieval-grounded checks. Every PR runs this. It takes about 90 seconds because the cases are batched into a single judge call with a structured-output schema, and because the judge is the cheap calibrated judge — not the full-quality one used for release candidates.

We track which layer caught each shipped fix in a regression log, and the histogram has been consistent over the last six months in customer deployments:

Rolling-six-month aggregate across active Divinci CI deployments. Reported as the % of confirmed regressions where the layer named was the first to fail. Internal — measured by us.

The 3% that escape are why post 5’s instant rollback exists. The gates do not promise zero escapes; they promise a tight upper bound and a fast recovery for what gets through.

CI fleet sizing — how the 12-minute suite stays cheap

The full-suite layer is where the math has to work. A naïve implementation calls the judge once per case-per-dimension, runs them sequentially, and the bill scales linearly with case count. Three optimisations do most of the work to keep it tractable:

Embedding cache. The retrieval-context fingerprint for each golden-dataset case is hashed; if the case has not changed and the retrieval index has not changed, the cached embedding stands and the retrieval step is skipped. Hit rate after the first stable week is consistently above 90% in our customer deployments.

Judge batching. The calibrated judge is called with structured output, batching 8–16 cases per call. The judge’s per-token cost stays the same; the per-case overhead drops because system prompt amortises across the batch. The threshold for safe batching is set by the judge’s own calibrated agreement at that batch size^[2] — we measure this during the weekly judge-calibration pass (post 7).

KV-cache reuse across cases. For models where the same system prompt and tool definitions head every call, the KV cache for that prefix is computed once per suite run, not once per case^[3]. On open-weights deployments this is straightforward; on closed-API models it depends on the provider’s prefix-caching support.

The combined effect lands the full suite at roughly the cost numbers shown in the layer-cake diagram above. The exact figures are internal, but the ratio is the public claim: ~74% of PRs spend zero judge dollars; ~22% spend pennies; the remaining 4% spend a couple of dollars for the highest-confidence pre-rollout signal we know how to produce.

Shadow CI — turn it on without breaking the team

The single mistake we have watched teams make most often is flipping a new gate from “off” to “blocking” on day one. The thresholds were tuned on yesterday’s data, the false-positive rate is unknown, and the first time the gate fires the team has no calibration for whether it is real or a false alarm. The on-call eval engineer gets paged, the gate gets disabled, trust is gone, the project is dead.

The fix is shadow CI: run the new gate non-blocking for two weeks, post the result as a bot comment on every PR, and review the false-positive rate weekly before flipping it to blocking. The Divinci CI runner has a --shadow flag for exactly this. The PR comment looks the same as the eventual blocking version — same diff display, same per-slice breakdown — except it does not gate merge.

divinci ci run --layer=full --shadow --duration=14d --report-as=bot-comment

When the false-positive rate is below 5% sustained across the window, we flip it. When it is not, we tighten the per-slice thresholds, recalibrate the judge, and shadow again. Either way the team has not been ambushed by a new gate that fires on day one.

A GitHub Actions workflow that actually composes

The piece that ties the layer cake into your existing CI runs in .github/workflows/llm-ci.yaml. The layers are wired so the cheap ones fail fast and the expensive ones only run when they need to — needs: chains and path-filtered triggers do the work^[5].

name: LLM CI
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'config/models.yaml'
      - 'eval/**'
      - 'retrieval/**'
      - 'manifests/**'
jobs:
  contract:
    runs-on: ubuntu-latest
    timeout-minutes: 2
    steps:
      - uses: actions/checkout@v4
      - run: divinci ci contract --manifest manifests/staging.yaml --fail-fast
  smoke:
    needs: contract
    runs-on: ubuntu-latest
    timeout-minutes: 5
    steps:
      - uses: actions/checkout@v4
      - run: divinci ci run --layer=smoke --post-pr-comment
        env:
          DIVINCI_API_KEY: ${{ secrets.DIVINCI_API_KEY }}
  full:
    needs: smoke
    if: contains(steps.changes.outputs.paths, 'prompts/') || contains(steps.changes.outputs.paths, 'config/models.yaml')
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - run: divinci ci run --layer=full --post-pr-comment --gate
        env:
          DIVINCI_API_KEY: ${{ secrets.DIVINCI_API_KEY }}

Three things to notice. Layers chain via needs:, so smoke does not run on a broken contract and full does not run on broken smoke. The full job is path-filtered to the changes that actually warrant a 12-minute run — a typo fix in the README does not trigger the gate suite. The --post-pr-comment flag is what makes the per-slice diff visible without leaving GitHub.

The failed-PR debug loop

The other half of “the gate fired” is “show me why.” A regression-suite output of medical slice task-completion dropped 0.04 is unactionable without the cases that caused it. We surface the five worst per-slice diffs in the PR comment, with the original input, the baseline output, the candidate output, and the judge’s reasoning trace. The debug loop is meant to take seconds, not minutes:

# Pull the 5 worst cases that fired the medical-slice gate on this PR
divinci ci diffs --pr 1247 --slice medical --dimension task_completion --top 5

This is the same diagnostic surface as post 6’s seven-step tree, wired into the CI feedback loop. The engineer who opened the PR sees the case-level evidence on the PR itself; they do not have to go open a separate eval dashboard.

Version-control discipline — prompts, datasets, judges as code

Prompt templates, golden datasets, and judge prompts all live in the repo, hash-pinned in the release manifest. The manifest is the single object that ties the suite to a specific reproducible state:

# manifests/staging.yaml — every CI run hashes this
release_id: rel-staging
model:     { sha: 0c1f9…, weights: r2://models/custom-v7.2,  open_weights: true }
prompt:    { sha: c4a8e…, template: prompts/support/v3.4.j2 }
retrieval: { sha: b21f0…, index: r2://indices/kb-2026-04 }
judge:     { sha: d8e21…, rubric: eval/rubrics/v7.yaml }
dataset:   { sha: a90b1…, file:   eval/datasets/golden-2026-04.jsonl }

When a CI run posts a score, the score is tagged with that manifest hash. When a score moves, the question “which input moved” has a direct answer: diff the manifest, and the layer that fired tells you which dimension to look at first. This is the loop the post 1 four-stage pipeline and the vIndex receipt from post 4 close together: the manifest is the audit primitive that all eight of these posts have, in different framings, been building toward.

What this does not solve

The same three honest limitations we have been writing into every post in this series.

CI does not test what is not in the suite. No matter how clever the layer cake is, the only regressions it catches are the ones some case in the golden dataset would have flagged. The replay layer mitigates this for behaviour drift, but novel queries that have never been seen still escape until they show up in production. The system has to be paired with production monitoring.
Cost numbers shift with model pricing. Every cost figure in this post depends on judge token rates, embedding rates, and inference rates that drift quarterly. The ratios — 74% / 22% / 4%, 31% / 27% / 28% / 11% / 3% — are the load-bearing claims; the dollar figures are illustrative for a moment in time.
Provider-side checkpoint changes are still hard. When a closed-API provider quietly updates the model behind a stable name, the contract layer cannot catch it; only the gate suite can, and only after the fact. We mitigate by pinning explicit checkpoint identifiers wherever the provider supports them, and by treating the day a checkpoint is announced as a triggering event for a full-suite re-baseline. We cannot prevent the underlying problem.

Wrapping the series

This is post 8 of 8. The full arc:

How to Build an LLM CI/CD Pipeline With Divinci AI — the four-stage pipeline (Register / Gate / Roll / Observe) that everything since has lived inside.
10 CI/CD Release Failures in Custom Language Models — the named 2026 failure modes, each mapped to the stage that should have caught it.
12 QA and Release Management Capabilities for LLMs — the capability matrix and the three-camps Venn that places Divinci against the alternatives.
Validating and Releasing Custom LMs in Regulated Fields — the compliance deep-dive, regulator-to-stage mapping, vIndex receipts.
Automated LLM CI/CD Pipelines With Instant Rollback — the operational layer, automation spectrum, auto-rollback receipt.
How to Diagnose Custom LLM QA Failures in 7 Steps — the diagnostic decision tree; the model is the right answer roughly one alert in seven.
Automated Regression Testing for Custom LLMs in 2026 — slice-aware Spearman gates, calibrated judges, closed-loop production-trace replay.
This post. The CI infrastructure that makes all of the above tractable on every PR.

The pieces compose: the manifest is the audit primitive, the gates are the safety layer, the diagnostic tree is the recovery loop, the vIndex receipt is the external anchor, and the layer cake is what makes the whole thing affordable to run on every commit. If your custom-LLM release process does not have these five together, the gap is what these eight posts have been about.

FAQ

What is the cheapest test I can run on every commit?

A prompt-template render check. It runs in milliseconds, requires no judge, catches a surprising fraction of breakages, and never costs a measurable cent. If you are not running it yet, it is the single highest-ROI piece of CI we know how to recommend.

How much should I expect a custom-LLM CI pipeline to cost?

Cents per typical PR, low single dollars per release-candidate PR. The ratio depends on judge pricing and on what fraction of your PRs touch prompts or model config. The 4% release-candidate share above is typical; for products with frequent prompt iteration the share rises and the average climbs accordingly.

Should I run the full suite on every commit?

No. Path-filter to PRs that touch prompts, model config, retrieval, or eval code. For all other changes, contract + smoke is sufficient and a 12-minute wait on a README typo will lose you the team’s trust within a sprint. The full suite is precious; spend it where the change can plausibly move a quality dimension.

How do I introduce a new gate without breaking everyone?

Two-week shadow window, non-blocking. Tune thresholds on the false-positive rate observed during the shadow. Flip to blocking only when sustained false-positive rate is below your tolerance (we use 5%). Anything else is how you get a gate everyone has learned to ignore.

What is the single number I should track if I track only one?

The fraction of confirmed regressions caught before production. The histogram in this post puts that at ~97% in mature Divinci deployments. The 3% that escape are why instant rollback exists. The 97% is what the suite is for.

References

DORA / Google Cloud. "Accelerate State of DevOps — CI velocity, change-failure-rate and time-to-restore-service." The cross-industry baselines that make "12 minutes per PR is too slow" a defensible claim and not an opinion.
Zheng et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." arXiv:2306.05685. The empirical evidence that batched LLM-as-judge calls can preserve calibration at the batch sizes used in the smoke and full layers — the reason the cost numbers in this post are achievable.
Pope et al. "Efficiently Scaling Transformer Inference." arXiv:2211.05102. The KV-cache reuse and prefix-sharing techniques cited in the CI-fleet-sizing section.
Pan, Tianpan. "The Semver Lie: how a minor LLM update broke production." 29 April 2026. The 2026 named failure mode for aggregate-only regression suites; the reason the CI layer cake is slice-aware all the way through.
GitHub. "GitHub Actions — chaining jobs with `needs:` and conditional execution." The primitive the .yaml in this post composes against.
Zhang et al. "BERTScore: Evaluating Text Generation with BERT." arXiv:1904.09675. The heuristic semantic-similarity metric referenced as an alternative to LLM-as-judge for the cheaper layers; not what we run at gate time, but useful in the contract layer for forbidden-phrase detection at scale.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today