Title: Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
**Authors: Lianmin Zheng et al.
Word Count: Approximately 6,400 words
Estimated Read Time: 22-24 minutes
Summary:
The paper proposes using strong LLMs as judges to evaluate LLM-based chat assistants in a scalable way. The authors examine the use of LLM-as-a-judge by looking at position bias, verbosity bias, and limited reasoning ability. They evaluate LLM judges using two benchmarks: MT-Bench, a multi-turn question set, and Chatbot Arena, a crowdsourced platform.
The results show that GPT-4 achieves over 80% agreement with human preferences, matching the level of human-human agreement. This suggests that LLM-as-a-judge is a promising alternative to costly human evaluations.
The authors explore variations of LLM-as-a-judge: pairwise comparison, single answer grading, and chain-of-thought/reference-guided judging. They also examine finetuning a Vicuna base model as a judge.
The authors release MT-Bench with 80 questions and Chatbot Arena data comprising 30K conversations. They argue for a hybrid evaluation framework combining standardized benchmarks and LLM-as-a-judge evaluations.
Evaluation of variants with MMLU, TruthfulQA, and MT-Bench show that benchmarks capture complementary aspects, indicating the need for a comprehensive evaluation approach.
In summary, the paper provides empirical evidence that LLMs can serve as scalable proxies for human preferences in chatbot evaluation. However, further work is needed to mitigate biases and improve LLM judging models.
Potential Use: LLM-as-a-judge can enable fast, automated assessments of LLMs' helpfulness, relevance and instruction-following ability in human-aligned dialogue systems. The proposed benchmarks and finetuning methods can be used to improve existing dialogue models.