Abstract
Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.
Original language | English |
---|---|
Title of host publication | 2021 Conference on Empirical Methods in Natural Language Processing |
Publisher | Association for Computational Linguistics |
Pages | 8932-8939 |
Number of pages | 8 |
ISBN (Electronic) | 9781955917094 |
Publication status | Published - 2021 |
Event | 2021 Conference on Empirical Methods in Natural Language Processing - Virtual, Punta Cana, Dominican Republic Duration: 7 Nov 2021 → 11 Nov 2021 |
Conference
Conference | 2021 Conference on Empirical Methods in Natural Language Processing |
---|---|
Abbreviated title | EMNLP 2021 |
Country/Territory | Dominican Republic |
City | Virtual, Punta Cana |
Period | 7/11/21 → 11/11/21 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Information Systems