What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think

David M. Howcroft, Verena Rieser

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Previous work has shown that human evaluations in NLP are notoriously under-powered. Here, we argue that there are two common factors which make this problem even worse: NLP studies usually (a) treat ordinal data as interval data and (b) operate under high variance settings while the differences they are hoping to detect are often subtle. We demonstrate through simulation that ordinal mixed effects models are better able to detect small differences between models, especially in high variance settings common in evaluations of generated texts. We release tools for researchers to conduct their own power analysis and test their assumptions. We also make recommendations for improving statistical power.

Original languageEnglish
Title of host publication2021 Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages8932-8939
Number of pages8
ISBN (Electronic)9781955917094
Publication statusPublished - 2021
Event2021 Conference on Empirical Methods in Natural Language Processing - Virtual, Punta Cana, Dominican Republic
Duration: 7 Nov 202111 Nov 2021

Conference

Conference2021 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2021
Country/TerritoryDominican Republic
CityVirtual, Punta Cana
Period7/11/2111/11/21

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'What happens if you treat ordinal ratings as interval data? Human evaluations in NLP are even more under-powered than you think'. Together they form a unique fingerprint.

Cite this