Speech recognition and synthesis

Speech recognition and synthesis are core technologies in the construction of a fully functional Companion. Speech is the most natural way for human being to communicate and receive information, and is far easier to use than keyboards or other computer input devices.

Speech recognition has been very successful in restricted domains, especially where there is a limited vocabulary (eg train booking services on the telephone) or enough in-domain text data to train a language model (dictation of medical reports). Within such restricted domains its accuracy as a technology can be very high but outside of these limits, accuracy can fall substantially. The performance of speech technology is also easily adversely affected by surrounding noise, quality of a microphone and / or possible accent of the speaker.

To summarize, one could say that the main issue of current speech recognition systems is robustness in general - robustness to different discussed topics on the language modeling side and robustness to different acoustic conditions on the acoustic modeling side. Our research within the project is therefore aimed in this direction. Since our expertise is mostly in the processing of the Czech language, our experiments are conducted using mostly Czech data.

"Within Companions, our objective is to integrate the components of the dialogue system, user models and knowledge representation so as to improve speech recognition accuracy"

Speech in the Companions Project

The robustness of the language models could be most efficiently improved by collecting vast amount of in-domain texts for training. However, given the inherently unconstrained nature of conversations that the users will have with a Companion (especially the Senior Companion), it's hard to tell what the general domain actually is (if there even is any). We have nevertheless recorded over 50 hours of Wizard of Oz interviews and plan to investigate the properties of this corpus and its benefits for the recognition accuracy.

Given the limited amount of 'in-domain' data, we further aim to improve robustness of the language models by incorporating additional information that could be derived from the available corpora, such as detailed morphological tags and semantic classes.

On the acoustic modeling side, the main effort is directed towards developing novel speaker adaptation technique that will be completely 'hidden' from the users, ie it would not require the usual 'training session' with each new user. Such training is usually annoying and requires the speaker's tight cooperation. The proposed method will create the signal transformation matrices gradually with the increasing amount of input speech (recognized with a sufficient confidence score) and, as such, would improve the system performance continuously over a longer period.

The issue of speech errors

Some implementations of the Companions agents over the Internet will be based on written dialogue and will therefore not be dependent on the performance of speech recognition systems. Others, in particular for handheld or mobile systems, are likely to be implemented mostly as spoken dialogue systems, and their overall performance is likely to be affected by speech recognition accuracy.

It has been established that speech recognition (SR) accuracy tends to degrade from controlled pronunciation (e.g. during dictation) to natural conversations (Oviatt 2000), and this could affect Companions, despite the progress of state-of-the-art ASR performance. However, as demonstrated since the late 90s by France Telecom R&D laboratories with the Artimis system (Sadek 1999), the overall performance of a dialogue system can be far superior to that of its speech recognition layer.

It is now accepted that, for systems that aim at utterance understanding, Word Error Rate (WER) is not the most appropriate metric. Boros et al. (1996) introduced the notion of 'concept accuracy' to describe the semantic impact of SR errors. However this measures the impact on the processing of an isolated utterance and as such does not constitute a proper dialogue metric.

Glass et al. (2000) have introduced metrics aiming at characterising dialogue performance. Query Density (QD) measures how effectively the user can transmit information to the system by quantifying the number of new concepts introduced per user query. In the present context it could be relevant to extend this metric from its original information-seeking dialogue formulation to multiple dialogue genres. Concept Efficiency (CE) is a measure of understanding through dialogue, in the form of the average number of dialogue turns required for each concept to be understood by the system.

Ultimately however, in Companions the issue of speech recognition accuracy has to be considered in the integrated context of ECA rather than through ASR benchmarks only.

ECA influence on speech patterns

There are some specificities of ECA-based dialogues with respect to other dialogue interfaces which should be considered here (Oviatt and Adams 2000).

The first one, originally described by Julia and Cheyer (1999) consists in the potential influence of the ECA appearance and behaviour on the users' speech patterns. Another one is the potential impact of recognition / understanding errors on the user-ECA relation. Fischer and Batliner (2000) have classified recognition errors in dialogue systems according to their emotional impact on the user.

Several strategies will be explored as part as Companions, which will relate affective dialogue to SR accuracy:

  • Internal measures of SR confidence scores would lead to adapt the ECA dialogue strategy, specifically detecting user irritation and avoiding behaviours that could upset the user.
  • SR errors would generate adapted response in terms of ECA animation, politeness strategies, and appropriate / careful use of humour.
  • Definition of a 'level of understanding' which constitutes a continuum from the accurate understanding of the utterance meaning, to the simple affective categorisation of that utterance.

Speech synthesis: emotion and multimodality

Among the crucial research issues relevant for Companions, research on 'emotional' aspects of speech production has received a growing interest during the past few years. Within this area, prosody control is a hot topic, because at present the prosody of most state-of-the-art synthesizers falls short of being able to reproduce the variation quality required for emotional speech.

Input to the synthesizers is of vital importance for getting a better prosody control: raw texts cannot specify the appropriate paralinguistic interpretation of semantic content. Annotated input to a synthesizer would allow a finer specification of speaking style and of the intended interpretation of a message.

Multimodality of speech synthesis is another area of increasing growth. The integration of voice output, gesture, and facial expression reflects the fact that speech accompanied by visual information provides a more robust and rich way of communicating, particularly in noise environments, and when young people, or the elderly, are involved.

The issue of basic units for speech synthesis is an old one, but many researchers postulate that it will once again come to the fore, because the granularity of the unit that is used for selection is an crucial aspect, because an advance in this research area can provide a dramatic improvement in synthesized voice quality (perhaps the improvement needed to approach demanding, emerging application fields, such as entertainment, customer-care, robots, education, home and car automation).

If the next generation speech synthesizer is to be used in unobtrusive conversational interaction with human interlocutors, there will be a need for expression of moods and attitudes, and more use will be made of 'fillers' such as laugh, cough, filled pauses, etc. (e.g. see Hamza et al. 2004, and Zovato et al. 2004 for an alternative approach).

Updated: 12 December 2008 15:56 PM