Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions

David M. Howcroft*, Anya Belz, Miruna Clinciu, Dimitra Gkatzia, Sadid A. Hasan, Saad Mahamood, Simon Mille, Emiel van Miltenburg, Sashank Santhanam, Verena Rieser

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

29 Citations (Scopus)
8 Downloads (Pure)

Abstract

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Original languageEnglish
Title of host publicationProceedings of the 13th International Conference on Natural Language Generation
EditorsBrian Davis, Yvette Graham, John Kelleher, Yaji Sripada
PublisherAssociation for Computational Linguistics
Pages169-182
Number of pages14
ISBN (Electronic)9781952148545
Publication statusPublished - Dec 2020
Event13th International Conference on Natural Language Generation 2020 - Virtual, Dublin, Ireland
Duration: 15 Dec 202018 Dec 2020

Conference

Conference13th International Conference on Natural Language Generation 2020
Abbreviated titleINLG 2020
Country/TerritoryIreland
CityVirtual, Dublin
Period15/12/2018/12/20

ASJC Scopus subject areas

  • Language and Linguistics
  • Software
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Twenty Years of Confusion in Human Evaluation: NLG Needs Evaluation Sheets and Standardised Definitions'. Together they form a unique fingerprint.

Cite this