The BURCHAK corpus: a challenge data set for interactive learning of visually grounded word meanings

Yanchao Yu, Arash Eshghi, Gregory Mills, Oliver Lemon

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We motivate and describe a new freely available human-human dialogue data set for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented visual attribute words (such as “burchak” for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self- and other-correction, mid-sentence continuations, interruptions, overlaps, fillers, and hedges. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78%
turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rulebased system built previously.
Original languageEnglish
Title of host publicationProceedings of the Sixth Workshop on Vision and Language
EditorsAnya Belz, Erkut Erdem , Katerina Pastra , Krystian Mikolajczyk
PublisherAssociation for Computational Linguistics
Pages1-10
ISBN (Print)978-1-945626-51-7
DOIs
Publication statusPublished - Apr 2017
Event6th Workshop on Vision and Language 2017 - Valencia, Spain
Duration: 4 Apr 20174 Apr 2017
Conference number: 6

Conference

Conference6th Workshop on Vision and Language 2017
Abbreviated titleVL'17
CountrySpain
CityValencia
Period4/04/174/04/17

Fingerprint Dive into the research topics of 'The BURCHAK corpus: a challenge data set for interactive learning of visually grounded word meanings'. Together they form a unique fingerprint.

Cite this