Automated metrics that agree with human judgements on generated output for an embodied conversational agent

Mary Ellen Foster

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.
Original languageEnglish
Title of host publicationProceedings of The 5th International Natural Language Generation Conference (INLG 2008)
Pages95-103
Publication statusPublished - 2008

Fingerprint

Dive into the research topics of 'Automated metrics that agree with human judgements on generated output for an embodied conversational agent'. Together they form a unique fingerprint.

Cite this