Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit

Blaise Potard, Matthew P. Aylett, David A. Baude

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with pre-recorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.
Original languageEnglish
Title of host publicationIntelligent Virtual Agents. IVA 2016
PublisherSpringer
Pages190-197
Number of pages8
ISBN (Electronic)9783319476650
ISBN (Print)9783319476643
DOIs
Publication statusPublished - 19 Oct 2016
Event16th International Conference on Intelligent Virtual Agents 2016 - Los Angeles, United States
Duration: 20 Sept 201623 Sept 2016

Conference

Conference16th International Conference on Intelligent Virtual Agents 2016
Abbreviated titleIVA 2016
Country/TerritoryUnited States
CityLos Angeles
Period20/09/1623/09/16

Fingerprint

Dive into the research topics of 'Cross modal evaluation of high quality emotional speech synthesis with the virtual human toolkit'. Together they form a unique fingerprint.

Cite this