TY - UNPB
T1 - LLMs instead of Human Judges?
T2 - A Large Scale Empirical Study across 20 NLP Evaluation Tasks
AU - Bavaresco, Anna
AU - Bernardi, Raffaella
AU - Bertolazzi, Leonardo
AU - Elliott, Desmond
AU - Fernández, Raquel
AU - Gatt, Albert
AU - Ghaleb, Esam
AU - Giulianelli, Mario
AU - Hanna, Michael
AU - Koller, Alexander
AU - Martins, André F. T.
AU - Mondorf, Philipp
AU - Neplenbroek, Vera
AU - Pezzelle, Sandro
AU - Plank, Barbara
AU - Schlangen, David
AU - Suglia, Alessandro
AU - Surikuchi, Aditya K.
AU - Takmaz, Ece
AU - Testoni, Alberto
PY - 2024/6/26
Y1 - 2024/6/26
N2 - There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
AB - There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
KW - cs.CL
U2 - 10.48550/arXiv.2406.18403
DO - 10.48550/arXiv.2406.18403
M3 - Preprint
BT - LLMs instead of Human Judges?
PB - arXiv
ER -