Human Ratings of Natural Language Generation Outputs

  • Jekaterina Novikova (Creator)
  • Ondrej Dusek (Creator)
  • Amanda Cercas Curry (Creator)
  • Verena Rieser (Creator)



This dataset comprises Likert scale human ratings for texts produced by three recent data-driven natural language generation (NLG) systems over three different datasets, as provided to us by the systems’ authors. We collected 3 or more ratings per for informativeness, naturalness, and overall quality of the NLG-produced text, given the source meaning representation. The ratings were used to evaluate current automatic metrics for NLG and motivate the development of new, improved ones.
Date made available2017
PublisherHeriot-Watt University
Date of data production2017

Research Output

  • 1 Conference contribution

Why We Need New Evaluation Metrics for NLG

Novikova, J., Dusek, O., Cercas Curry, A. & Rieser, V., 10 Sep 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, p. 2231-2242 12 p.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Open Access

Cite this

Novikova, J. (Creator), Dusek, O. (Creator), Cercas Curry, A. (Creator), Rieser, V. (Creator) (2017). Human Ratings of Natural Language Generation Outputs. Heriot-Watt University. NLG_human_ratings(.zip).