• Anita Gelsinger
  • 21.04.2013
  • 05.08.2013

For a voice-controlled interactive TV system, the decisive components for a natural voice dialogue are the acoustic front-end and the command recognition. The acoustic front-end removes disturbing signal components like loudspeaker echoes from the microphone signals and uses a power threshold to detect speech segments. The segments are then fed into a speech recognizer, which tries to match the segment to a command according to a previously defined dictionary and grammar. For example, the command could be "[please] set [the] volume to 10", where "please" and "the" are words that are recognized by the speech recognition engine, but defined as optional in the grammar.

While this approach allows a high recognition rate for correctly spoken commands, and also tolerates minor deviations from the defined grammar, it fails to detect if the user speaks a sentence that is not intended as a command, e.g., while talking to another person, or a command that is not anticipated in the system. In this case, the recognizer tries to match the recorded sentence to a command, and performs an action that the user probably did not intend.

The aim of this thesis is the investigation of methods to prevent the system from reacting to these "out-of-grammar" utterances. One possibility is keyword recognition, i.e., requiring the user to speak a prefix before the command ("TV, increase volume"), so that commands can be differentiated from other speech. Another possibility is the detection of out-of-grammar utterances, either with a "garbage model" or by evaluating the confidence score of the speech recognition result. Ideally, both methods could be combined.

The work consists of researching various alternatives, investigating the feasibility of implementation with our speech recognition engine, and possibly implementing promising concepts in our demonstration system.

The thesis can be written in German or English.