Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ondrej Dusek, Amanda Cercas Curry, Verena Rieser

Research output: Chapter in Book/Report/Conference proceedingConference contribution

49 Citations (Scopus)

Abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU. In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.
Original languageEnglish
Title of host publication Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics
Pages2231-2242
Number of pages12
ISBN (Electronic)978-1-945626-83-8
DOIs
Publication statusPublished - 10 Sep 2017
Event2017 Conference on Empirical Methods in Natural Language Processing - Øksnehallen, Copenhagen, Denmark
Duration: 9 Sep 201711 Sep 2017

Conference

Conference2017 Conference on Empirical Methods in Natural Language Processing
Abbreviated titleEMNLP 2017
CountryDenmark
CityCopenhagen
Period9/09/1711/09/17

Keywords

  • natural language generation
  • natural language processing
  • evaluation
  • evaluation metrics

Cite this