Abstract
Designing a spoken dialogue system can be a time-consuming and challenging process. A developer may spend a lot of time and effort anticipating the potential needs of a specific application environment and then deciding on the most appropriate system response. These decisions are encoded in a dialogue strategy, which determines the system's behaviour in a particular dialogue context.
To facilitate strategy development, recent research investigates the use of Reinforcement Learning (RL) methods applied to automatic dialogue strategy optimisation. One of the key advantages of statistical optimisation methods, such as RL, for dialogue strategy design is that the problem can be formulated as a principled mathematical model which can be automatically optimised by training on real data.
For new application domains where a system is designed from scratch, however, there is often no suitable in-domain data available. Collecting dialogue data without a working prototype is problematic, leaving the developer with a classic chicken-and-egg problem.
This thesis proposes to learn dialogue strategies by simulation-based RL using trial-and-error learning, where the simulated environment is learned from small amounts of Wizard-of-Oz (WOZ) data. Using WOZ data rather than data from real Human-Computer Interaction (HCI) allows us to learn optimal strategies for new application areas beyond the scope of existing dialogue systems. %domains where no working dialogue system already exists. Optimised learned strategies are then available from the first moment of online-operation, and tedious handcrafting of dialogue strategies is fully omitted. We call this method `bootstrapping'.
We apply this framework to optimise multimodal information-seeking dialogue strategies for an in-car MP3 music player. Dialogue Management and multimodal output generation are two closely interrelated problems for information seeking dialogues: the decision of when to present information depends on how many pieces of information to present and the available options for how to present them, and vice versa. We therefore formulate the problem as a hierarchy of joint learning decisions which are optimised together. We see this as a first step towards an integrated statistical model of Dialogue Management and output planning.
In the bootstrapping framework the following steps are required.
We first set up a data collection in a WOZ experiment, which already closely anticipates the final application.
From this data we construct a simulated learning environment, where we introduce appropriate methods for learning from limited data sets. We then train and test a RL-based policy by simulated interaction.
We compare the optimised policy against a supervised baseline. This comparison allows us to measure the relative improvements over the initial strategies in the training data. We also evaluate the policies by testing with real users.
Our results show that the RL-based policy significantly outperforms Supervised Learning. The RL-based policy gains on average 50% more reward in simulation, and 18% more reward when interacting with real users. Users also rate the RL policy on average 10% higher. Finally, we post-evaluate the framework by comparing different aspects of the three corpora gathered so far: the WOZ data, the dialogues generated from simulated interaction, and the user tests. We conclude that bootstrapping from WOZ data allows us to learn policies which are optimised for real user preferences.
To facilitate strategy development, recent research investigates the use of Reinforcement Learning (RL) methods applied to automatic dialogue strategy optimisation. One of the key advantages of statistical optimisation methods, such as RL, for dialogue strategy design is that the problem can be formulated as a principled mathematical model which can be automatically optimised by training on real data.
For new application domains where a system is designed from scratch, however, there is often no suitable in-domain data available. Collecting dialogue data without a working prototype is problematic, leaving the developer with a classic chicken-and-egg problem.
This thesis proposes to learn dialogue strategies by simulation-based RL using trial-and-error learning, where the simulated environment is learned from small amounts of Wizard-of-Oz (WOZ) data. Using WOZ data rather than data from real Human-Computer Interaction (HCI) allows us to learn optimal strategies for new application areas beyond the scope of existing dialogue systems. %domains where no working dialogue system already exists. Optimised learned strategies are then available from the first moment of online-operation, and tedious handcrafting of dialogue strategies is fully omitted. We call this method `bootstrapping'.
We apply this framework to optimise multimodal information-seeking dialogue strategies for an in-car MP3 music player. Dialogue Management and multimodal output generation are two closely interrelated problems for information seeking dialogues: the decision of when to present information depends on how many pieces of information to present and the available options for how to present them, and vice versa. We therefore formulate the problem as a hierarchy of joint learning decisions which are optimised together. We see this as a first step towards an integrated statistical model of Dialogue Management and output planning.
In the bootstrapping framework the following steps are required.
We first set up a data collection in a WOZ experiment, which already closely anticipates the final application.
From this data we construct a simulated learning environment, where we introduce appropriate methods for learning from limited data sets. We then train and test a RL-based policy by simulated interaction.
We compare the optimised policy against a supervised baseline. This comparison allows us to measure the relative improvements over the initial strategies in the training data. We also evaluate the policies by testing with real users.
Our results show that the RL-based policy significantly outperforms Supervised Learning. The RL-based policy gains on average 50% more reward in simulation, and 18% more reward when interacting with real users. Users also rate the RL policy on average 10% higher. Finally, we post-evaluate the framework by comparing different aspects of the three corpora gathered so far: the WOZ data, the dialogues generated from simulated interaction, and the user tests. We conclude that bootstrapping from WOZ data allows us to learn policies which are optimised for real user preferences.
Original language | English |
---|---|
Place of Publication | Saarbrücken |
Publisher | German Research Center for Artificial Intelligence |
Number of pages | 304 |
Volume | XXVIII |
Publication status | Published - 2008 |