Abstract
Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with pre-recorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale.
Original language | English |
---|---|
Title of host publication | Intelligent Virtual Agents. IVA 2016 |
Publisher | Springer |
Pages | 190-197 |
Number of pages | 8 |
ISBN (Electronic) | 9783319476650 |
ISBN (Print) | 9783319476643 |
DOIs | |
Publication status | Published - 19 Oct 2016 |
Event | 16th International Conference on Intelligent Virtual Agents 2016 - Los Angeles, United States Duration: 20 Sept 2016 → 23 Sept 2016 |
Conference
Conference | 16th International Conference on Intelligent Virtual Agents 2016 |
---|---|
Abbreviated title | IVA 2016 |
Country/Territory | United States |
City | Los Angeles |
Period | 20/09/16 → 23/09/16 |