You are reading help file online using chmlib.com
|
This document is intended to help developers of speech recognition (SR) applications use the Microsoft Speech API's speech recognition and audio APIs to connect a wav file with an SR engine. The topics covered include:
There are many different types of audio input configurations used by SR applications, which include:
The shared desktop microphone scenario uses the default SR engine and the default audio input. The user selects each in Speech properties in Control Panel, and each is hosted in the shared speech server.
The telephony scenario can use either the SAPI 5 standard multimedia audio input object or a custom audio object combined with an InProc SR engine.
The wav file input scenario is special because it uses controlled, reproducible audio input and requires a dedicated SR engine, without interference from other applications (e.g., a shared desktop microphone). The file input scenario should use a generic SAPI audio stream connected to the input wav file and an InProc SR engine.
Unlike microphone input which has no predetermined stream length, a finite-length audio input stream is a file which has a specific length that is known before recognition begins. Similarly, applications that use microphone input will toggle between actively listening and not listening states until the speech application is closed. However, transcription applications are typically designed to listen to one continuous audio stream, and then close when the stream ends. Consequently, the application must specifically acknowledge the end audio stream event (SPEI_SR_END_STREAM for C/C++, ISpeechRecoContext::EndStream event for Automation). Transcription applications can potentially record multiple recognitions on a single audio stream, if the speaker pauses or breaks between sections of audio. If the transcription application exits after the first recognition event is received, it will miss any further recognizable audio that remains.
Microphone input and networked audio streams are typically real-time audio objects. This means that the audio object is designed to support audio buffering and dynamic state manipulation (e.g. stop->play->pause->play->stop) to handle delays and latency in the audio source and/or the SR engine's processing.
C-style is very similar to C++ and COM
{
CComPtr<ISpStream> cpInputStream;
CComPtr<ISpRecognizer> cpRecognizer;
CComPtr<ISpRecoContext> cpRecoContext;
CComPtr<ISpRecoGrammar> cpRecoGrammar;
// Create basic SAPI stream object
// NOTE: The helper SpBindToFile can be used to perform the following operations
hr = cpInputStream.CoCreateInstance(CLSID_SpStream);
// Check hr
CSpStreamFormat sInputFormat;
// generate WaveFormatEx structure, assuming the wav format is 22kHz, 16-bit, Stereo
hr = sInputFormat.AssignFormat(SPSF_22kHz16BitStereo);
// Check hr
// setup stream object with wav file MY_WAVE_AUDIO_FILENAME
// for read-only access, since it will only be access by the SR engine
hr = cpInputStream->BindToFile(MY_WAVE_AUDIO_FILENAME,
SPFM_OPEN_READONLY,
sInputFormat.FormatId(),
sInputFormat.WaveFormatExPtr(),
SPFEI_ALL_EVENTS);
// Check hr
// Create in-process speech recognition engine
hr = cpRecognizer.CoCreateInstance(CLSID_SpInprocRecognizer);
// Check hr
// connect wav input to recognizer
// SAPI will negotiate mismatched engine/input audio formats using system audio codecs, so second parameter is not important - use default of TRUE
hr = cpRecognizer->SetInput(cpInputStream, TRUE);
// Check hr
// Create recognition context to receive events
hr = cpRecognizer->CreateRecoContext(&cpRecoContext);
// Check hr
// Create grammar, and load dictation
// ignore grammar ID for simplicity's sake
// NOTE: Voice command apps would load CFG here
hr = cpRecognizer->CreateGrammar(NULL, &cpRecoGrammar);
// Check hr
hr = cpRecoGrammar->LoadDictation(NULL,SPLO_STATIC);
// Check hr
// check for recognitions and end of stream event
hr = cpRecoContext->SetInterest(SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM), SPFEI(SPEI_RECOGNITION) | SPFEI(SPEI_SR_END_STREAM));
// use Win32 events for command-line style application
hr = cpRecoContext->SetNotifyWin32Event();
// Check hr
// activate dictation, and begin recognition
hr = cpRecoGrammar->SetDictationState(SPRS_ACTIVE);
// Check hr
// while events occur, continue processing
// timeout should be greater than the audio stream length, or a reasonable amount of time expected to pass before no more recognitions are expected in an audio stream
BOOL fEndStreamReached = FALSE;
while (!fEndStreamReached && S_OK == cpRecoContext->WaitForNotifyEvent(MY_REASONABLE_TIMEOUT))
{
CSpEvent spEvent;
// pull all queued events from the reco context's event queue
while (!fEndStreamReached && S_OK == spEvent.GetFrom(cpRecoContext))
{
// Check event type
switch (spEvent.eEventId)
{
// speech recognition engine recognized some audio
case SPEI_RECOGNITION:
// TODO: log/report recognized text
break;
// end of the wav file was reached by the speech recognition engine
case SPEI_SR_END_STREAM:
fEndStreamReached = TRUE;
break;
}
// clear any event data/object references
spEvent.Clear();
}// END event pulling loop - break on empty event queue OR end stream
}// END event polling loop - break on event timeout OR end stream
// deactivate dictation
hr = cpRecoGrammar->SetDictationState(SPRS_INACTIVE);
// Check hr
// unload dictation topic
hr = cpRecoGrammar->UnloadDictation();
// Check hr
// close the input stream, since we're done with it
// NOTE: smart pointer will call SpStream's destructor, and consequently ::Close, but code may want to check for errors on ::Close operation
hr = cpInputStream->Close();
// Check hr
}
Scripting code is similar to Visual Basic.
Option Explicit
Dim WithEvents RecoContext as ISpeechRecoContext ' context for receiving SR events
Dim Grammar as ISpeechRecoGrammar ' grammar
Dim InputFile as SpeechLib.SpFileStream ' wav audio input file stream
' Setup InProc reco context and wav audio input file
Private Sub MyForm_Load()
' Create new recognizer
Dim Recognizer as New SpInprocRecognizer
' create input file stream
Set InputFile as New SpFileStream
' Defaults to open for read-only, and DoEvents false
InputFile.Open MY_WAVE_AUDIO_FILENAME
' connect wav audio input to speech recognition engine
Set Recognizer.AudioInputStream = InputFile
' create recognition context
Set RecoContext = Recognizer.CreateRecoContext
' create grammar
Set Grammar = RecoContext.CreateGrammar
' ... and load dictation
Grammar.DictationLoad
' start dictating
Grammar.DictationSetState SGDSActive
End Sub
' Event fired on app shutdown
Private Sub MyForm_Unload(Cancel as Boolean)
InputFile.Close ' close audio input file
End Sub
' Event fired when speech recognition engine recognizes audio
Private Sub RecoContext_Recognition(StreamNumber as Long, StreamPosition as Variant, RecognitionType As SpeechRecognitionType, Result As ISpeechRecoResult)
' Log/Report recognized phrase/information
End Sub
' End of wav Input Stream reached by speech recognition engine
Private Sub RecoContext_EndStream(StreamNumber as Long, StreamPosition as Variant)
' Disable dictation and unload grammars on app close
Grammar.DictationSetState SGDSInactive
Grammar.DictationUnload
Unload Me ' shutdown app on end of stream
End Sub
You are reading help file online using chmlib.com
|