RankME: Reliable Human Ratings for Natural Language Generation

Jekaterina Novikova, Ondrej Dusek, Verena Rieser

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.
Original languageEnglish
Title of host publicationProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
PublisherAssociation for Computational Linguistics
Pages72-78
Number of pages7
ISBN (Electronic)9781948087292
DOIs
Publication statusPublished - 1 Jun 2018
Event16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - New Orleans, United States
Duration: 1 Jun 20186 Jun 2018

Conference

Conference16th Annual Conference of the North American Chapter of the Association for Computational Linguistics
Abbreviated titleNAACL HLT 2018
CountryUnited States
CityNew Orleans
Period1/06/186/06/18

Fingerprint Dive into the research topics of 'RankME: Reliable Human Ratings for Natural Language Generation'. Together they form a unique fingerprint.

  • Cite this

    Novikova, J., Dusek, O., & Rieser, V. (2018). RankME: Reliable Human Ratings for Natural Language Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 72-78). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-2012