Human Ratings of Natural Language Generation Outputs

  • Jekaterina Novikova (Creator)
  • Ondrej Dusek (Creator)
  • Amanda Cercas Curry (Creator)
  • Verena Rieser (Creator)



This dataset comprises Likert scale human ratings for texts produced by three recent data-driven natural language generation (NLG) systems over three different datasets, as provided to us by the systems’ authors. We collected 3 or more ratings per for informativeness, naturalness, and overall quality of the NLG-produced text, given the source meaning representation. The ratings were used to evaluate current automatic metrics for NLG and motivate the development of new, improved ones.
Date made available2017
PublisherHeriot-Watt University
Date of data production2017

Research Output

  • 1 Conference contribution

Why We Need New Evaluation Metrics for NLG

Novikova, J., Dusek, O., Cercas Curry, A. & Rieser, V., 10 Sep 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, p. 2231-2242 12 p.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Open Access
49 Citations (Scopus)

Cite this

Novikova, J. (Creator), Dusek, O. (Creator), Cercas Curry, A. (Creator), Rieser, V. (Creator) (2017). Human Ratings of Natural Language Generation Outputs. Heriot-Watt University. NLG_human_ratings(.zip).