Abstract
The majority of NLG evaluation relies on automatic metrics, such as BLEU. In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing |
| Publisher | Association for Computational Linguistics |
| Pages | 2231-2242 |
| Number of pages | 12 |
| ISBN (Electronic) | 978-1-945626-83-8 |
| DOIs | |
| Publication status | Published - 10 Sept 2017 |
| Event | 2017 Conference on Empirical Methods in Natural Language Processing - Øksnehallen, Copenhagen, Denmark Duration: 9 Sept 2017 → 11 Sept 2017 |
Conference
| Conference | 2017 Conference on Empirical Methods in Natural Language Processing |
|---|---|
| Abbreviated title | EMNLP 2017 |
| Country/Territory | Denmark |
| City | Copenhagen |
| Period | 9/09/17 → 11/09/17 |
Keywords
- natural language generation
- natural language processing
- evaluation
- evaluation metrics
Fingerprint
Dive into the research topics of 'Why We Need New Evaluation Metrics for NLG'. Together they form a unique fingerprint.Datasets
-
Human Ratings of Natural Language Generation Outputs
Novikova, J. (Creator), Dusek, O. (Creator), Cercas Curry, A. (Creator) & Rieser, V. (Creator), Heriot-Watt University, 2017
https://github.com/jeknov/EMNLP_17_submission
Dataset