ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text

Tanvi Dinkar, Gavin Abercrombie, Verena Rieser*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)
2 Downloads (Pure)


ReproHum is a large multi-institution project designed to examine the reproducibility of human evaluations of natural language processing. As part of the second phase of the project, we attempt to reproduce an evaluation of the fluency of continuations generated by a pre-trained language model compared to a range of baselines. Working within the constraints of the project, with limited information about the original study, and without access to their participant pool, or the responses of individual participants, we find that we are not able to reproduce the original results. Our participants display a greater tendency to prefer one of the system responses, avoiding a judgement of ‘equal fluency’ more than in the original study. We also conduct further evaluations: we elicit ratings from (1) a broader range of participants; (2) from the same participants at different times; and (3) with an altered definition of fluency. Results of these experiments suggest that the original evaluation collected too few ratings, and that the task formulation may be quite ambiguous. Overall, although we were able to conduct a re-evaluation study, we conclude that the original evaluation was not comprehensive enough to make truly meaningful comparisons
Original languageEnglish
Title of host publicationProceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
EditorsSimone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
PublisherELRA Language Resources Association
Number of pages8
ISBN (Print)9782493814418
Publication statusPublished - 21 May 2024
Event4th Workshop on Human Evaluation of NLP Systems 2024 - Torino, Italy
Duration: 21 May 2024 → …


Conference4th Workshop on Human Evaluation of NLP Systems 2024
Abbreviated titleHumEval 2024
Period21/05/24 → …


  • Evaluation
  • Fluency
  • NLG
  • Reproducibility

ASJC Scopus subject areas

  • Education
  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language


Dive into the research topics of 'ReproHum #0927-03: DExpert Evaluation? Reproducing Human Judgements of the Fluency of Generated Text'. Together they form a unique fingerprint.

Cite this