Microsoft Speech SDK

SAPI 5.1

Chapter 2

CoffeeS0

Introduction

This first example represents a basic speech recognition (SR) application. The foundation is presented and over the next series of examples, progressively complex features will be added.

The example will focus on setting the foundation of speech-enabled applications in general. The following topics will be discussed:

Initialization: Setting up engines and grammars

Events: Definition of events, notifications, interests, expanding events

Phrases: Definition of phrases, grammar rules, accepting phrases

Running the Example

Run the application by executing CoffeeS0.exe. A window appears and along the top you are greeted: Welcome to the SAPI coffee shop. Speak for service! Since this is a simple example, your only option is to say, “Please go to the counter.” Variations are also accepted such as, “Please go to counter,” or “enter counter.” For this example, even “go to shop” and “enter the store” are recognized for reasons examined later. Speak clearly to activate the voice recognizer and the screen will change to “Please order when ready!”

Perhaps limited in functionality (you cannot actually order coffee in this coffee shop) it does represent the fundamentals of speech recognition. That is, commands are spoken, recognized, and executed. Look at the Coffee code in Coffee.cpp. It looks very much like a simple Windows application, because, in fact, that is what it is. The application brings up one window, and draws two strings to it. To keep the code to a minimum, it does not support keyboard operations, a mouse, or a menu. To exit the application, click the Close button in the upper right of the screen.

The speech API commands are interspersed with the Windows API. Chances are, if you do not recognize a command as belonging to the Windows API, it does not. It belongs either to SAPI 5 or it is part of the structure needed to process speech. For instance, there is a new message, WM_RECOEVENT, which defines the application. It is not a part of SAPI but the application structure needs this message to process speech. In the same manner, several routines are also defined by the application (InitSAPI(), CleanupSAPI(), ProcessRecoEvent(), and ExecuteCommand()). Inside these Coffee-defined functions are the SAPI 5 method calls. There are also a few #defines at the top of the file for including SAPI 5-specific headers.

Header Files

Before starting out with any coding, the proper headers need to be introduced.

// Contains definitions of SAPI functions

#include <sphelper.h>

// Contains common defines
#include "common.h"

// Forward declarations and constants
#include "coffee.h"

// This header is created by our grammar compiler and has the rule ids
#include "cofgram.h"

Of the five headers, two are application-specific to the Coffee series. Coffee.h contains the function prototypes and global variables for the application and common.h lists #defines for the application’s windows and other features. As the samples change and expand, these two files are updated. However, this has no effect on the speech aspects of the program. A third file, stdafx.h, is related to COM programming and it is maintained by the compiler. This file can also change and have no impact on speech-related issues.

The other two files are speech related. Sphelper.h is provided by SAPI 5 and is the header file for the helper functions. For the programmer’s convenience, there is a list of functions consolidating a series of SAPI API methods into a single call. You are not required to use helper functions for SAPI programming, although doing so may simplify programming. The Coffee samples use helper functions whenever possible. Finally, the grammar compiler uses cofgram.h. A grammar is a list of words available to the application. The topics of grammars and the compiler in general will be discussed in later chapters.

For SAPI 5 to work, it needs to be initialized. This is done in four basic steps with each step depending on the success of the previous one. You should always check the return values just as in Windows programming.

Initialization

Step one: COM

First, initialize COM making sure it is present and active. Use the COM command CoInitialize() and later CoUninitialize(). To ensure COM is active throughout the application’s session, these commands are usually nested around the main message loop.

// Only continue is COM is successfully initialized
if ( SUCCEEDED( CoInitialize( NULL ) ) )
{
	// Main message loop:
	while (GetMessage(&msg, NULL, 0, 0))
	{

		//Code here
	}

	CoUninitialize();
}

Step two: Recognizer Object

Second, create the recognizer object. This object provides access to the recognition engine.

//Global Definition

CComPtr<ISpRecognizer> g_cpEngine;

// create a recognition engine
hr = g_cpEngine.CoCreateInstance(CLSID_SpSharedRecognizer);
if ( FAILED( hr ) ) // Leave application

A single instance of a recognizer object is created. There are two options for setting up this object: shared and in-process (InProc). Use the following class identifiers (CLSIDs) to set up the instance:

CLSID_SpSharedRecoContext	Creates a shared resource instance
CLSID_SpInprocRecognizer	Creates an InProc or non-shared resource instance.

Shared Instance

Shared instances allow resources such as recognition engines, audio input (microphones), and output devices to be used by several applications at the same time. This is the preferred option for most desktop applications. It is common for a desktop system to have only one microphone and by using shared instances, different applications such as a browser, word processor, and a game can use the microphone. Any application using a shared instance will start the SAPI server process. This is an executable program running in the background. It delivers events to the owning application.

InProc Instance

InProc instances, however, allow one and only one application to control the resources. This includes the microphone and speech recognition engine. Using an InProc procedure is very restrictive and you should use it only in special circumstances. For example, you would use InProc if you wanted the entire microphone input to be channeled through one application. Telephony applications are a good example of the need to restrict use to one microphone or audio input source.

Step three: Recognition Context

Third, create a recognition context for the engine.

//Global Definition
CComPtr<ISpRecoContext> g_cpRecoCtxt;

// create the command recognition context
hr = g_cpEngine->CreateRecoContext(&g_cpRecoCtxt );
if ( FAILED( hr ) ) //Leave application

A context is any single area of the application needing to process speech. A simple case (like CoffeeS0) assigns the entire application to only one recognition context. No matter where you are in the application, all speech events and messages are handled by the same procedure. Alternatively, each part of the application may have a different context. Individual windows, dialog boxes, menu bars, or even menu items (such as the Open or Print menu items) may have their own context. Events or messages generated from these areas are processed by their own procedures. This is similar to the way individual windows process events and messages in standard Win32 applications. That is, each window is assigned a window procedure that handles all the events and messages. In the same way, each recognition context is assigned a procedure as well. This way, you have greater control over the program and the handling of speech events. Contexts are created dynamically the moment they are needed and destroyed afterward. Alternatively, you may create them once and retain them throughout the application’s life. However, Coffee is a simple example and uses only one context.

IspRecoContext is an important interface and will be the primary means for recognition. From the interface, the application can load and unload grammars as well as get and respond to events.

Step four: Loading Grammars and Rules

The last major part of the startup sequence is loading the grammar. A grammar specifies what the speech recognizer will recognize.

// Load our grammar

// user defined
("SRGRAMMAR") resource type.

hr =
g_cpRecoCtxt->CreateGrammar(GRAMMARID1, &g_cpCmdGrammar);

if ( FAILED( hr ) ) //Leave application

hr = g_cpCmdGrammar->LoadCmdFromResource(
	NULL,
	MAKEINTRESOURCEW(IDR_CMD_CFG),
	L"SRGRAMMAR",
	MAKELANGID( LANG_NEUTRAL, SUBLANG_NEUTRAL), TRUE);

if ( FAILED( hr ) ) //Leave application

Essentially there are two types of grammars. One is for dictation and the other is command and control. The dictation grammar is a more free-formed approach to speech. You are able to draw on a very large portion of the body of words for the language. Command and control is a much more limited list of words. For Coffee examples, you only need certain words and only then to move around the store and to order drinks. It makes no sense for Coffee to know about the word “opisthognathous” so why even attempt to find it? Besides, you are going to sip your coffee anyway.

The Coffee list is a pregenerated set of commands stored as a resource internal to the application. Coffee.xml saves this in a human-readable format. Extensible Markup Language (XML) is the markup language used to generate the grammar and the format that the file uses as defined by SAPI. You need to compile this file into a binary version so SAPI 5 can use it. You can do this ahead of time or on the fly. Here it has been precompiled it so that the grammar is delivered inside the application. The SAPI 5.1 SDK has a grammar compiler called GramComp, delivered in the tool suite. Grammars are covered in more detail in the CoffeeS1 example.

IspRecoContext, as mentioned in Step 3, creates the grammar. Once made, you populate the grammar with words from command list. Use ::LoadCmdFromResource, since it is stored as an application resource. Another way you could load it is from other sources, including an external file, memory, or an existing object. After you have retrieved the grammar, you need to set the rules. As a convenience, the XML itself activates the initial set of rules. Specifically the TOPLEVEL=”ACTIVE” tag in coffee.xml does this. The following is an example of an explicit application setting:

// Set rules to
active, we are now listening for commands
hr = g_cpCmdGrammar->SetRuleState( NULL, NULL, SPRS_ACTIVE );

The method explicitly sets any rules it encounters to become active. Because the two NULL parameters values did not exclude any rules, all of them were activated. You can also deactivate rules using this method. If the call fails, the application posts a message giving the most likely reason.

Events

CoffeeS0 is now an application that can take speech input. It is processing speech in the background. When SAPI has information, it returns it back to the application. SAPI will notify you when an event happens. In short, an event is a condition of special interest to SAPI. Examples of events include, when a sound is first detected on the microphone (SPEI_SOUND_START), when it ends (SPEI_SOUND_END), or when it successfully completes a word recognition (SPEI_RECOGNITION). SAPI maintains several types of events, of which the enumerated type SPEVENTENUM, maintains a complete list. Two important concepts tie events into the application: Notifications and Interests.

Notifications

A notification indicates that a SAPI event has occurred and the application might want to react. It does not relay exactly what happened – for that you will have to dig a little deeper.

To react to notifications, the application has to associate them with specific procedures. There are several ways to do this. The ISpNotifySource interface has four methods: SetNotifyCallbackFunction, SetNotifyCallbackInterface, SetNotifyWin32Event, and SetNotifyWindowMessage. An additional one, ISpNotifySink::Notify, provides a generic method allowing for special or unusual conditions. You can use any or all of these methods depending on your needs. For instance, it might be easier for an application to handle a notification by directly calling a function (::SetNotifyCallbackFunction) such as bringing up a new dialog box or automatically logging the activity in a file. Three of the methods, ::SetNotifyCallbackFunction, ::SetNotifyCallbackInterface, ::SetNotifyWindowMessage, require a message loop and therefore may only be used by Windows applications. You can use the other two, ::SetNotifyWin32Event and ::ISpNotifySink::Notify, without a message loop to provide additional flexibility.

CoffeeS0 sends the notification through a Window procedure. Since SAPI messages are not system level messages, you have to tell the application explicitly about them.

hr = g_cpRecoCtxt->SetNotifyWindowMessage( hWnd, WM_RECOEVENT, 0, 0 );
if ( FAILED( hr ) ) //Leave application

This method associates a message to a specific window. Afterward, any events SAPI passes back will be received by the application in the singular form of a WM_RECOEVENT message, and then sent to the window pointed to by hWnd. You have the option of directing the wParam and lParam window parameters as well. Because CoffeeS0 is not concerned about them here, the application has set them to zero or NULL.

Interests

An interest is a flag allowing or restricting the kind of events SAPI passes back. By default, SAPI sends all events back to the application. So far that is more than 30 different kinds of events. You cannot be concerned about all of them. In reality, CoffeeS0 truly cares about one type: the successful word recognition or SPEI_RECOGNITION event. You can tell SAPI to pass back only this one event. To filter these events, use ::SetInterest.

hr =
g_cpRecoCtxt->SetInterest( SPFEI(SPEI_RECOGNITION), SPFEI(SPEI_RECOGNITION) );


if ( FAILED( hr ) ) //Leave application

This sets the interest to just one message, SPEI_RECOGNITION. That is, only a successful recognition event generates a notification. SAPI will not notify the application on any other event.

You can define multiple interests using the exclusive OR operator. Two values are set. The first parameter lists the interests in general. That is, it defines all the events you are, or could be, concerned with for the time being. The second parameter lists the events to be queued so that the application can handle them in due time. In this tutorial, you are interested in every occurrence of SPEI_RECOGNITION, even if they come so quickly that the application cannot handle them at one time. Often, these two parameters will be identical but, obviously, they don’t have to be. SPFEI() is a helper function used to reformat the enumerated events into a ULONGLONG number.

Is the application now fully functional? Again, almost. It is true, the application can initialize SAPI, accept speech from a microphone, attempt to recognize it, and send back an event to the application if a word is matched. You still need to put the message in the event loop in order for the application to handle the event. Because it is already defined and known by application, WM_RECOEVENT can be put in like any other message. The following code fragment is from main window procedure WndProc:

// This is our application defined window message to let us know that a
// speech recognition event has occurred.

case WM_RECOEVENT:
	ProcessRecoEvent( hWnd );
	break;

Each time SAPI is ready to return a word, it sends out an event. The application receives this message as WM_RECOEVENT. The main message loop picks it up and, in this case, sends it to the ProcessRecoEvent() for routine processing.

For speech recognition, this event is handled in a slightly different manner than the way Windows approaches it. Normally, Windows sends the exact message based on the event. In this way you do not have to determine if it were a mouse or keyboard event; the message provides this information. SAPI, on the other hand, does not. You still need to know the exact nature of the event as well as the number of events waiting in the queue.

For this reason, SAPI introduces its own event description system. See the following code for an example:

void ProcessRecoEvent( HWND hWnd )
{
// Event helper class

CSpEvent event; 


// Loop processing events while there are any in the queue
while (event.GetFrom(g_cpRecoCtxt) == S_OK)
	{
		// Look at recognition event only
		
		switch (event.eEventId)
		{
			case SPEI_RECOGNITION:
				ExecuteCommand(event.RecoResult(), hWnd);
				break;
		}
	}
}

Three items are needed to fully describe the event. The first item is CspEvent. This is a helper class function that contains several useful functions and is an SPEVENT structure. One method, ::GetFrom, does two things at the same time. It retrieves the next event from the queue and loads the corresponding information into the SPEVENT structure making it ready for your inspection.

The second item is to determine which SAPI event actually took place. At this point you only have to look at the member eEventId. If it is an event you are not interested in, skip it and keeping looking or waiting for other events.

The last item is to match the event with your needs. In this example you are interested only in SPEI_RECOGNITION. If there is a match, you are closer to your goal of finding out which word was spoken. The switch statement handles that for you.

Phrases

Determining the actual phrase is the last part of the process. Having initialized SAPI, received a notice that SAPI has identified a word, and had the application process the message, you now need to isolate the word.

SAPI returns the word information in a list or a series of lists. These lists contain not only the word, but also additional information about the word, words or the entire phrase. Examine the following code:

void ExecuteCommand(ISpPhrase *pPhrase, HWND hWnd)
{
	SPPHRASE *pElements;

	// Get the phrase elements, one of which is the rule id we specified in
	// the grammar. Switch on it to figure out which command was recognized.
	if (SUCCEEDED(pPhrase->GetPhrase(&pElements)))
	{ 
		switch ( pElements->Rule.ulId )
		{
			case VID_Navigation:
			{
				switch( pElements->pProperties->vValue.ulVal )
				{
					case VID_Counter:
						PostMessage( hWnd, WM_GOTOCOUNTER, NULL, NULL );
						break;

				}
			}
			break;
		}
	// Free the pElements memory which was allocated for us
	::CoTaskMemFree(pElements);
	}
}

Without knowing too much about phrase structures, it is obvious that the code drills down into it. This function takes an IspPhrase interface, extracts an exact phrase element (in the form of pElements) and determines which rule has been invoked. Other examples will go one step further and get the exact words spoken. Phrases and rules will be discussed in languishing detail [CoffeeS2].

To understand rules better, look at coffee.xml. Notice that there are two rules defined. RULE ID tags delineate both. The main rule is VID_Navigation. The second one is VID_Place, but this is of lesser importance because it is subservient to VID_Navigation. In essence, coffee.xml defines two sets of phrases. The first set uses the commands “Enter” and “Go To,” and the second set uses the words “counter,” “shop,” and “store.” The recognizer mixes and matches words, selecting one from the first set and one from the second set. Therefore, sentences such as “enter store” and “go to counter” invoke a SAPI rule. There are also optional words that may be used or not used. “Please enter the store” is not only more polite but it is also a valid match. However, the recognizer ignores “please” and “the.” This way, you can speak more naturally and at the same time not encumber SAPI.

However, this example does not go quite that far. The code stops at the rule level in the switch statement with “case VID_Counter.” Once you have said, “Please enter the store,” SAPI invokes the appropriate rule (VID_Navigation). That is why you could have said “counter” or “shop” and still have gone to the same place. In contrast, had you said, “Please enter the restaurant,” no rule would have been invoked because no definition includes “restaurant.” In later examples, the exact word will be of more interest.

At this point, a rule has been invoked and caught by the application. Now you can process it as you see fit. CoffeeS0 is interested only in providing you with textual feedback and so it passes a PostMessage() back to the owning window with instructions to change the text. “Please order when ready!” appears on the screen.

Conclusions

Hopefully you are not overwhelmed at this point. SAPI 5, like all other systems, requires a certain amount of overhead programming. The intent was to minimize this overhead so that you can concentrate on the function of the application rather than working with the SAPI code. Most of the code introduced here is initialization and should only happen once during the application launch. With that done, you are free to add additional features.

You are reading help file online using chmlib.com

If you want your help file to be removed or added please send e-mail to chmlibcom@gmail.com

Partner sites: Logo Design, Simple Anti-Crisis Accounting Software, Voice Search for Web