Automated metrics that agree with human judgements on generated output for an embodied conversational agent

Mary Ellen Foster

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.
Original languageEnglish
Title of host publicationProceedings of The 5th International Natural Language Generation Conference (INLG 2008)
Pages95-103
Publication statusPublished - 2008

Fingerprint

Display devices

Cite this

Foster, M. E. (2008). Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of The 5th International Natural Language Generation Conference (INLG 2008) (pp. 95-103)
Foster, Mary Ellen. / Automated metrics that agree with human judgements on generated output for an embodied conversational agent. Proceedings of The 5th International Natural Language Generation Conference (INLG 2008). 2008. pp. 95-103
@inproceedings{4a4935c885e5485181ff7d5f9b209465,
title = "Automated metrics that agree with human judgements on generated output for an embodied conversational agent",
abstract = "When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.",
author = "Foster, {Mary Ellen}",
year = "2008",
language = "English",
pages = "95--103",
booktitle = "Proceedings of The 5th International Natural Language Generation Conference (INLG 2008)",

}

Foster, ME 2008, Automated metrics that agree with human judgements on generated output for an embodied conversational agent. in Proceedings of The 5th International Natural Language Generation Conference (INLG 2008). pp. 95-103.

Automated metrics that agree with human judgements on generated output for an embodied conversational agent. / Foster, Mary Ellen.

Proceedings of The 5th International Natural Language Generation Conference (INLG 2008). 2008. p. 95-103.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Automated metrics that agree with human judgements on generated output for an embodied conversational agent

AU - Foster, Mary Ellen

PY - 2008

Y1 - 2008

N2 - When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.

AB - When evaluating a generation system, if a corpus of target outputs is available, a common and simple strategy is to compare the system output against the corpus contents. However, cross-validation metrics that test whether the system makes exactly the same choices as the corpus on each item have recently been shown not to correlate well with human judgements of quality. An alternative evaluation strategy is to compute intrinsic, task-specific properties of the generated output; this requires more domain-specific metrics, but can often produce a better assessment of the output. In this paper, a range of metrics using both of these techniques are used to evaluate three methods for selecting the facial displays of an embodied conversational agent, and the predictions of the metrics are compared with human judgements of the same generated output. The corpus-reproduction metrics show no relationship with the human judgements, while the intrinsic metrics that capture the number and variety of facial displays show a significant correlation with the preferences of the human users.

M3 - Conference contribution

SP - 95

EP - 103

BT - Proceedings of The 5th International Natural Language Generation Conference (INLG 2008)

ER -

Foster ME. Automated metrics that agree with human judgements on generated output for an embodied conversational agent. In Proceedings of The 5th International Natural Language Generation Conference (INLG 2008). 2008. p. 95-103