Speech recognition and synthesis
Speech recognition and synthesis are core technologies in the construction of a fully functional Companion. Speech is the most natural way for human being to communicate and receive information, and is far easier to use than keyboards or other computer input devices.
Speech recognition has been very successful in restricted domains, especially where there is a limited vocabulary (eg train booking services on the telephone) or enough in-domain text data to train a language model (dictation of medical reports). Within such restricted domains its accuracy as a technology can be very high but outside of these limits, accuracy can fall substantially. The performance of speech technology is also easily adversely affected by surrounding noise, quality of a microphone and / or possible accent of the speaker.
To summarize, one could say that the main issue of current speech recognition systems is robustness in general - robustness to different discussed topics on the language modeling side and robustness to different acoustic conditions on the acoustic modeling side. Our research within the project is therefore aimed in this direction. Since our expertise is mostly in the processing of the Czech language, our experiments are conducted using mostly Czech data.
"Within Companions, our objective is to integrate the components of the dialogue system, user models and knowledge representation so as to improve speech recognition accuracy"
Speech in the Companions Project
The robustness of the language models could be most efficiently improved by collecting vast amount of in-domain texts for training. However, given the inherently unconstrained nature of conversations that the users will have with a Companion (especially the Senior Companion), it's hard to tell what the general domain actually is (if there even is any). We have nevertheless recorded over 50 hours of Wizard of Oz interviews and plan to investigate the properties of this corpus and its benefits for the recognition accuracy.
Given the limited amount of 'in-domain' data, we further aim to improve robustness of the language models by incorporating additional information that could be derived from the available corpora, such as detailed morphological tags and semantic classes.
On the acoustic modeling side, the main effort is directed towards developing novel speaker adaptation technique that will be completely 'hidden' from the users, ie it would not require the usual 'training session' with each new user. Such training is usually annoying and requires the speaker's tight cooperation. The proposed method will create the signal transformation matrices gradually with the increasing amount of input speech (recognized with a sufficient confidence score) and, as such, would improve the system performance continuously over a longer period.
- Legát, M., Grüber, M. and Ircing, P. (2008) Wizard of Oz Data Collection for the Czech Senior Companion Dialogue System [PDF, 1.0Mb]. Fourth International Workshop on Human-Computer Conversation, Bellagio, Italy.
The issue of speech errors
Some implementations of the Companions agents over the Internet will be based on written dialogue and will therefore not be dependent on the performance of speech recognition systems. Others, in particular for handheld or mobile systems, are likely to be implemented mostly as spoken dialogue systems, and their overall performance is likely to be affected by speech recognition accuracy.
It has been established that speech recognition (SR) accuracy tends to degrade from controlled pronunciation (e.g. during dictation) to natural conversations (Oviatt 2000), and this could affect Companions, despite the progress of state-of-the-art ASR performance. However, as demonstrated since the late 90s by France Telecom R&D laboratories with the Artimis system (Sadek 1999), the overall performance of a dialogue system can be far superior to that of its speech recognition layer.
It is now accepted that, for systems that aim at utterance understanding, Word Error Rate (WER) is not the most appropriate metric. Boros et al. (1996) introduced the notion of 'concept accuracy' to describe the semantic impact of SR errors. However this measures the impact on the processing of an isolated utterance and as such does not constitute a proper dialogue metric.
Glass et al. (2000) have introduced metrics aiming at characterising dialogue performance. Query Density (QD) measures how effectively the user can transmit information to the system by quantifying the number of new concepts introduced per user query. In the present context it could be relevant to extend this metric from its original information-seeking dialogue formulation to multiple dialogue genres. Concept Efficiency (CE) is a measure of understanding through dialogue, in the form of the average number of dialogue turns required for each concept to be understood by the system.
Ultimately however, in Companions the issue of speech recognition accuracy has to be considered in the integrated context of ECA rather than through ASR benchmarks only.
- Oviatt, S.L. (2000) Taming Speech Recognition Errors Within a Multimodal Interface. Communications of the ACM 43:45-51 (special issue on Conversational Interfaces).
- Sadek, D. (1999) Design considerations on dialogue systems: from theory to technology - the case of Artimis. Proceedings of the ESCA Workshop on Interactive Dialogue in Multimodal Systems (Kloster Irsee: Germany), pp. 173-187.
- Boros, M., Wieland, E., Gallwitz, F., Gorz, G., Hanrieder, G. and Niemann, H. (1996) Toward understanding spontaneous speech: Word accuracy vs. semantic accuracy. Proceedings of the International Conference on Spoken Language Processing (ICSLP) 1996, pp. 1005 - 1008.
- Glass, J. Polifroni, J. Seneft, S. and Zue, V. (2000) Data collection and performance evaluation of spoken dialogue systems: The MIT experience. Proceedings of the International Conference on Spoken Language Processing (ICSLP) 2000, Beijing, China.
ECA influence on speech patterns
There are some specificities of ECA-based dialogues with respect to other dialogue interfaces which should be considered here (Oviatt and Adams 2000).
The first one, originally described by Julia and Cheyer (1999) consists in the potential influence of the ECA appearance and behaviour on the users' speech patterns. Another one is the potential impact of recognition / understanding errors on the user-ECA relation. Fischer and Batliner (2000) have classified recognition errors in dialogue systems according to their emotional impact on the user.
Several strategies will be explored as part as Companions, which will relate affective dialogue to SR accuracy:
- Internal measures of SR confidence scores would lead to adapt the ECA dialogue strategy, specifically detecting user irritation and avoiding behaviours that could upset the user.
- SR errors would generate adapted response in terms of ECA animation, politeness strategies, and appropriate / careful use of humour.
- Definition of a 'level of understanding' which constitutes a continuum from the accurate understanding of the utterance meaning, to the simple affective categorisation of that utterance.
- Oviatt, S.L. and Adams, B. (2000) Designing and Evaluating Conversational Interfaces with Animated Characters. In: Cassell, Justine et al. (eds) Embodied Conversational Agents (MIT Press: Cambridge), pp. 319-343.
- Julia, L. and Cheyer, A. (1999) Is Talking to Virtual more Realistic? Proceedings of EuroSpeech'99, Budapest, Hungary.
- Fischer, K. and Batliner, A. (2000) What Makes Speakers Angry in Human-Computer Conversation. Proceedings of the Third Workshop on Human-Computer Conversation, Bellagio, Italy, 3-5 July 2000.
Speech synthesis: emotion and multimodality
Among the crucial research issues relevant for Companions, research on 'emotional' aspects of speech production has received a growing interest during the past few years. Within this area, prosody control is a hot topic, because at present the prosody of most state-of-the-art synthesizers falls short of being able to reproduce the variation quality required for emotional speech.
Input to the synthesizers is of vital importance for getting a better prosody control: raw texts cannot specify the appropriate paralinguistic interpretation of semantic content. Annotated input to a synthesizer would allow a finer specification of speaking style and of the intended interpretation of a message.
Multimodality of speech synthesis is another area of increasing growth. The integration of voice output, gesture, and facial expression reflects the fact that speech accompanied by visual information provides a more robust and rich way of communicating, particularly in noise environments, and when young people, or the elderly, are involved.
The issue of basic units for speech synthesis is an old one, but many researchers postulate that it will once again come to the fore, because the granularity of the unit that is used for selection is an crucial aspect, because an advance in this research area can provide a dramatic improvement in synthesized voice quality (perhaps the improvement needed to approach demanding, emerging application fields, such as entertainment, customer-care, robots, education, home and car automation).
If the next generation speech synthesizer is to be used in unobtrusive conversational interaction with human interlocutors, there will be a need for expression of moods and attitudes, and more use will be made of 'fillers' such as laugh, cough, filled pauses, etc. (e.g. see Hamza et al. 2004, and Zovato et al. 2004 for an alternative approach).
- Hamza, W. et al. (2004) The IBM expressive speech synthesis system. Proceedings of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, 14-16 June 2004.
- Zovato, E., Pacchiotti, A., Quazza, S. and Sandri, S. (2004) Towards emotional speech synthesis: a rule based approach. Proceedings of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, 14-16 June 2004.
Updated: 12 December 2008 15:56 PM


