Skip to main content
Latest research:When the Circuit Dissolves →12 vindexes on Hugging Face
Request demo

RAG-Arena

Posts in tags: "RAG-Arena" (1 post)

Inside the RAG Arena: When the Judges Don't Agree

We ran a 200-item RAG arena on the AskTheDoctor corpus across three models and two retrieval configurations. The headline (v2-atd ≈ Llama 4 Scout, both at ~0.58) is interesting. The methodology footnote is more interesting: we then re-judged 415 of those answers with two different LLM judges and got Spearman ρ = 0.55 between them. That number is the case for human calibration.

Read More