Divinci AI - Excellence, every time

April 27, 2026

Product

Calibrating the Judge: The Grader get Graded

Most AI evaluation tools let LLMs judge LLMs without any human anchor. ScoredQA Calibration flips that — a domain expert rates 50 answers by hand, and we compute Spearman ρ between their ratings and each LLM judge to find which judge actually agrees with the human before we trust its scores.

ScoredQACalibrationEvaluationSpearmanRAG RoutingLLM-as-JudgeHuman-in-the-Loop

April 26, 2026

Research

Inside the RAG Arena: When the Judges Don't Agree

We ran a 200-item RAG arena on the AskTheDoctor corpus across three models and two retrieval configurations. The headline (v2-atd ≈ Llama 4 Scout, both at ~0.58) is interesting. The methodology footnote is more interesting: we then re-judged 415 of those answers with two different LLM judges and got Spearman ρ = 0.55 between them. That number is the case for human calibration.

RAG-ArenaScoredQARAG RoutingEXITLLM-as-JudgeSpearmanEvaluationQLoRA