Calibrating the Judge: The Grader get Graded
ScoredQA Calibration: a domain expert rates 50 answers, we compute Spearman ρ vs each LLM judge, and pick the judge that actually agrees.
Read More about Calibrating the Judge: The Grader get GradedPosts in tags: "Spearman" (2 posts)
ScoredQA Calibration: a domain expert rates 50 answers, we compute Spearman ρ vs each LLM judge, and pick the judge that actually agrees.
Read More about Calibrating the Judge: The Grader get GradedA 200-item RAG arena tied at the mean, but two LLM judges only agreed at Spearman ρ=0.55. They aren't measuring the same thing.
Read More about Inside the RAG Arena: When the Judges Don't Agree