How to Diagnose Custom LLM QA Failures in 7 Steps

Notes from the Release Cycle — Part VI

A scored-QA suite started flagging a customer’s medical-Q&A model. The headline number — aggregate quality across all slices — dropped 6 points overnight. The team spent two days debugging the model. They re-ran fine-tunes. They rolled back to the prior release. The numbers didn’t move.

On the morning of day three, somebody noticed the eval suite had been updated the same night the regression started. Three new pediatric-dosage prompts had been added to the test set, and the model had never seen pediatric dosage in training. The “QA failure” wasn’t a model regression. It was a slice-coverage event: the eval started asking about something the model was never supposed to know.

Across our customer rollouts, this is the dominant pattern. A “QA failure” alert is the symptom. The cause is the model roughly one time in seven. The other six times, the bug is somewhere upstream: in the eval design, in the judge calibration, in the prompt SHA, in the preprocessing pipeline, in the dataset version, or in the retrieval index. Each of those classes of bug looks identical from the alert — a number went down — but has a completely different fix.

This post is the diagnostic tree we walk in order when an alert fires. Six steps that rule out non-model causes, before the seventh step considers the model itself. Each step has a concrete API call or query that answers it. By the time you’ve completed the six, you either know exactly what to fix, or you’ve earned the right to look at the model.

The decision tree

The tree is sequential because the steps are cheap-to-expensive. Step 1 is a git diff of the eval suite; step 7 is a fine-tune cycle. You want to spend ten minutes on each of the six cheap checks before spending a week on the expensive one.

Step 1 — Did the eval cover this slice?

The symptom. Aggregate quality drops, but the per-slice breakdown shows one slice cratering while the others are flat. Or — more confusingly — every slice drops slightly, all by similar amounts.

The diagnostic. Diff the eval suite manifest SHA against the prior release’s. If the eval suite changed and you didn’t change the model, the regression is in the eval, not the model.

# Compare the eval-suite manifest SHA between releases
curl https://api.divinci.ai/v1/releases/rel_a01c66 | jq '.eval_suite_sha256'
curl https://api.divinci.ai/v1/releases/rel_8f72b1 | jq '.eval_suite_sha256'
# Different? Your eval changed. Audit what was added.

The fix. Either revert the eval-suite change (if it was unintentional), or expand training coverage to match the new eval (if the new slice is a real production concern). Don’t ship a model regression fix for an eval coverage problem — you’ll make the model worse on what it actually used to do well.

Where this hides in our pipeline. Stage 1 — Register binds the eval-suite SHA into the release manifest. The diagnostic above is just diffing two manifests. The reason the bug took the medical-Q&A team two days is that they had no manifest-level diff — they were comparing model checkpoints, not release manifests.

Step 2 — Is the judge calibrated to humans on this slice?

The symptom. A slice that’s new to the eval suite scores poorly, but human review of the model’s outputs on that slice rates them as fine. The judge thinks the model is failing; humans don’t.

The diagnostic. Compute Spearman ρ between the LLM judge’s ratings and a small human-rated sample (50 items) on the failing slice. If ρ < 0.4, the judge is not measuring what humans measure on this slice.

curl -X POST https://api.divinci.ai/v1/judges/<judge_id>/calibrate \
  -d '{ "slice": "pediatric-oncology-dosing", "human_ratings_csv": "..." }'
# → { "spearman_rho": 0.31, "interpretation": "judge_uncalibrated_for_slice" }

The fix. Either select a different judge model for this slice, or use a chain-of-judges with an arbiter. MT-Bench^[1] shows GPT-4-as-judge agrees with humans >80% on average but with per-category variance from 86% (coding) to 36–44% (writing/humanities). The variance is the operative number; “good on average” hides slices where the judge is wrong.

Where this hides in our pipeline. Stage 2 — Gate demands a calibrated judge per slice. The Calibrating the AI Judge post documents the procedure. If the slice was added to the eval without a calibration step, you have a structurally untrustworthy gate.

Step 3 — Does the prompt template SHA match production?

The symptom. Quality drops but the model_ref and dataset_ref are unchanged. Nothing about training changed. The model is the same model. And yet.

The diagnostic. Compare the prompt_template_ref SHA in the failing release manifest against the prior release’s. A 38-character edit to a system prompt that “improves brevity” can change downstream behavior more than a full retrain.

curl https://api.divinci.ai/v1/releases/rel_a01c66 | jq '.prompt_template_ref'
curl https://api.divinci.ai/v1/releases/rel_8f72b1 | jq '.prompt_template_ref'
# Different? Pull the diff. Look at it carefully.

The fix. Treat prompts as code. The 10 release failures post covers the dashboard-edit failure mode — Tianpan’s Semver Lie postmortem^[2] names this as the dominant 2026 failure pattern. If you can prove the prompt changed, you’ve found your bug.

Step 4 — Does the preprocessing pipeline match production?

The symptom. Model passes eval locally. Same model fails the same eval in production. Same model_ref, same prompt, same dataset.

The diagnostic. Pull the preprocessing_ref SHA from the production manifest and verify the eval ran with the same one. The classic case: training normalizes whitespace and lowercases; production doesn’t. The eval ran through the production preprocessing; everything checked. Until somebody updated preprocessing one side only.

curl https://api.divinci.ai/v1/releases/rel_a01c66 | jq '.preprocessing_ref'
# Compare to the preprocessing your training/eval harness actually used.

The fix. Containerize preprocessing as a versioned artifact. Reference it from the manifest. Refuse to deploy if the gate’s preprocessing SHA differs from production’s.

Step 5 — Does the dataset SHA match production?

The symptom. Gate-eval scores from the failing release are different from gate-eval scores from the same model the day before.

The diagnostic. Diff the dataset_version field between the two releases. The eval suite stayed the same name, but the dataset content was updated and re-tagged. Same name, different SHA, different scores.

diff <(curl .../rel_a01c66 | jq '.dataset_version') \
     <(curl .../rel_8f72b1 | jq '.dataset_version')

The fix. Content-hash your datasets. The dataset name is not a version; the SHA is.

Step 6 — Does the retrieval index SHA match production?

The symptom. For RAG workloads only. Quality drops on questions that depend on retrieved context. Direct-answer questions are unchanged.

The diagnostic. Pull the retrieval_index_ref SHA from the manifest. Compare against the gate evaluation’s retrieval-index. The RAG index updated overnight (a fresh ingestion run); the gate evaluation cached an older retrieval; the production canary used the new one.

curl https://api.divinci.ai/v1/releases/rel_a01c66 | jq '.retrieval_index_ref'

The fix. Bind the retrieval index SHA into the manifest, exactly the way we bind preprocessing. AutoRAG’s automated index rotation cadence makes this especially worth checking — the index will update on you whether you authorized it or not, if you’re not pinning it.

Step 7 — The model itself, finally

Six steps in. The eval covers the slice. The judge is calibrated. The prompt SHA matches. The preprocessing matches. The dataset matches. The retrieval index matches.

Now — and only now — you have earned the right to look at the model.

The diagnostic for this step is a per-slice Spearman comparison against the prior release, with both releases evaluated against the same manifest-pinned dataset, preprocessing, and retrieval. The number you produce is a clean signal: a real per-slice regression, with no upstream confounders.

curl -X POST https://api.divinci.ai/v1/releases/<failing_id>/diff-eval \
  -d '{ "baseline_release_id": "<prior_id>", "slices": ["legal-IP-licensing"] }'
# → { "spearman_rho_failing": 0.41, "spearman_rho_baseline": 0.68, "delta": -0.27 }

If the delta confirms a real regression: auto-rollback fires (if you didn’t already manually invoke it), and the failing model gets re-trained against an expanded slice-coverage corpus. If the gate that promoted this release missed the slice in the first place, the gate is also the bug — capability 4 missing from your release pipeline.

What the distribution actually looks like

The “1 in 7” framing earlier wasn’t a rhetorical device. It’s roughly the distribution we see across customer rollouts. Out of every seven QA alerts:

The two biggest slices are eval coverage gap and judge miscalibration. Together they account for half of QA alerts. Neither is a model problem. Both are problems with how you measure the model.

What this doesn’t solve

Three honest limitations:

The distribution is ours, not yours. The percentages above are from our customer cohort and our pipeline’s tooling. If you run a different kind of workload — heavy multi-modal, heavy agent-orchestrated, heavy single-shot generative — your distribution will look different. The diagnostic order should still hold; the numbers behind each step will not.

Step 1’s “eval coverage gap” is the hardest to fix. Adding the missing slice to your training corpus, building a calibrated judge for it, and re-running the canary is itself a multi-week project. The diagnostic is fast; the remediation is not.

A real regression can ride a non-model bug. The cases where you have both a prompt drift AND a real model regression are the worst ones, because step 3 finds the prompt drift, you fix it, and the alert re-fires. The diagnostic order in this post handles them but adds elapsed time. There’s no shortcut for “the bug was in two places at once.”

FAQ

Why does my LLM produce different outputs for similar prompts?

Prompt sensitivity is real, but it’s not always the cause of a QA alert — sometimes it’s a symptom of an upstream bug. Walk the diagnostic. If the prompt template SHA matches and the preprocessing matches and the dataset matches, then yes — the model has wide variance on this slice and you need a more deterministic decoding path or a different judge. If anything upstream changed, fix that first.

How often should you re-evaluate your LLM benchmarks?

Re-evaluate the benchmark content every time a production slice’s traffic shape changes meaningfully. Re-evaluate the benchmark’s judge calibration every quarter, at minimum — judge models drift faster than you’d think. The biggest source of false QA alerts is a benchmark that was last validated 18 months ago and is now measuring a thing your production no longer does.

What causes hallucinations in custom language models?

Hallucinations have multiple upstream causes — retrieval failures (step 6 in the tree above), training-coverage gaps (step 1, indirectly), and decoding-path issues (a real model concern in step 7). AutoRAG addresses the retrieval-side causes by binding the retrieval index into the release manifest and re-pinning on every release. The other two require pipeline-level fixes upstream of the model.

How do you know if your training data is the problem?

If the dataset SHA in the failing release matches the dataset SHA in the prior good release (step 5 of the tree), the data isn’t the immediate cause. If they differ, you’ve found it. The harder question — “is the dataset complete for our production slice coverage?” — is what step 1 tests. A dataset that’s complete for the eval but incomplete for production traffic is a slice-coverage problem.

Can you fix QA failures without retraining the entire model?

Yes — six out of seven times, the fix is not a retrain. Steps 1–6 in the tree have fixes that don’t touch the model: update the eval, recalibrate the judge, re-register the prompt SHA, fix preprocessing, re-pin the dataset, or re-pin the retrieval index. Retraining is step 7, the most expensive fix, reserved for actual model regressions. The release pipeline’s audit trail lets you do these upstream fixes with the same provenance discipline you’d use for a model change.

References

LLM-as-judge per-category variance. Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023). >80% overall GPT-4-vs-human agreement with per-category variance from coding (86%) down to writing (36–44%). Anchor for step 2 — why judge calibration has to be measured per slice rather than assumed from a published headline number.
The Semver Lie. Tianpan — The Semver Lie: how an LLM minor update breaks production (April 2026). The dominant 2026 failure-mode writeup. Names dashboard-edit prompt drift as the most-cited cause of production LLM incidents. Anchor for step 3.
NIST AI RMF — Measure function. NIST AI Risk Management Framework. The "Measure" function explicitly covers benchmark-validity and evaluation-coverage as part of governance, not as a separate engineering concern. Cited as the institutional anchor for treating eval design as the first diagnostic step.
RAGAS — retrieval-augmented generation evaluation. Es et al., RAGAS: Automated Evaluation of Retrieval Augmented Generation (arXiv:2309.15217). The reference framework for RAG-side evaluation. Anchor for step 6 — separating retrieval failures from generation failures requires a RAG-aware eval discipline.
Internal — root-cause distribution across customer rollouts. The percentages in the pie chart are our internal observation across Divinci customer rollouts, not from a controlled benchmark. Your distribution will vary by workload type, fine-tune cadence, and team discipline. The relative ordering (steps 1–2 dominating) is stable across the cohort we've measured; the exact percentages are not portable to your environment without your own data.
The four-stage release pipeline. How to Build an LLM CI/CD Pipeline With Divinci AI. Each diagnostic step in this post corresponds to a manifest field bound at Stage 1 — Register. Without the manifest discipline upstream, the diagnostic loses its grip; you can't diff what you didn't bind.

Next in this series: Automated Regression Testing for Custom LLMs in 2026. This post is about diagnosis after a QA alert fires. The next is about the regression-testing discipline that drove the alert in the first place — what to put in the eval, how to keep it honest, and what to do when the regression test starts disagreeing with your judge.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today