 Microsoft Speech SDK 
SAPI 5.1
Microsoft Speech SDK 
SAPI 5.1
Chapter 4
This chapter introduces two new concepts: navigation and politeness. Not only will you be able to go to the counter (this is the same as in CoffeeS1) but the manager will also open the office to you. However, you will not be able to order drinks from the office. The second feature will ask the patron to repeat the order if it was not clearly understood.
The following topics will be discussed:
· Grammars: Rules activation and deactivation.
· Events: Expanding events, SPEI_FALSE_RECOGNITION.
CoffeeS0 and CoffeeS1 introduced the concept of rule activation/deactivation in only the broadest sense. In fact, those rules made two assumptions. The first is that the grammar rule itself wanted the rules on by default and the second assumption was that the application wanted them on by default.
The command and control grammar sets the initial (or default) state of a particular rule to either active or inactive. In the XML file, the top-level rule is defined as:
<RULE ID="VID_Navigation" TOPLEVEL="ACTIVE">
If nothing changes this status, it remains in that state for the duration of the application’s lifespan. Setting “INACTIVE” will set the state to “off” by default.
However, you will need to react to the changes made by the users. In some situations, the grammar rule may no longer be appropriate. In CoffeeS1, you ordered drinks from any location. That is, from the counter, store or office. CoffeeS2 is more discriminating and limits drink order placement to the counter. As another example (and perhaps a more practical one), if you dictated to a speech-enabled word processor, the command and control words would no longer require special actions. If you spoke the word “menu,” it would confuse you to suddenly see a menu drop down.
In either case, you need a way to disable the grammar. A drastic method would be to simply unload the entire grammar when it was no longer required. However, in CoffeeS2’s instance, both the navigation and drink order rules are in the same file. If you unloaded the file, you would lose both rules altogether. Rather, selectively activate or deactivate individual top-level rules.
In the past, Coffee has simply accepted the rules as they were.
hr = g_cpCmdGrammar->SetRuleState( NULL, NULL, SPRS_ACTIVE );Passing NULL for both the rule name and rule value, you accept all the rules in the file as they stand. CoffeeS2 initially places caffeine-seeking patrons outside the shop. Their first command has to be a navigational one. Because drinks cannot be ordered at any place other than the counter, you must deactivate the rule. From InitSAPI() after LoadCmdFromResource():
hr = g_cpCmdGrammar->SetRuleIdState( VID_EspressoDrinks, SPRS_INACTIVE );
hr = g_cpCmdGrammar->SetRuleIdState( VID_Navigation, SPRS_ACTIVE );
The second line is included for completeness. If not included, the state would still be activated since that particular top-level rule is active. In the same vein, you could use the CoffeeS1 method and activate all the rules at one time and then selectively change rule states. CoffeeS2 has only two rules so either way you have the same number of lines of code.
If you activate a rule, you can also deactivate it. Deactivation is set in the CounterPaneProc() procedure, WM_INITPANE message. The placement of the call will be discussed in a moment. When the time comes, the rule deactivates with:
g_cpCmdGrammar->SetRuleIdState( VID_EspressoDrinks, SPRS_INACTIVE );
In order to choose which of the two locations (the counter or office), CoffeeS2 has to check its ExecuteCommand(). The matched grammar rule holds this information. By contrast, CoffeeS1 only wants to know if any navigation rule was invoked, not necessarily which one. For this new information, you can determine the rule almost identically to the way in which the case statement VID_EspressoDrinks determines drink orders. Given an IspPhrase interface with ExecuteCommand(), use GetPhrase() to retrieve the phrase returned by SAPI.
The next step between CoffeeS1 and CoffeeS2 appears to be different but fundamentally it is the same. Look at the same value structure that CoffeeS1 used to determine the rule :
switch( pElements->pProperties->vValue.ulVal )Remember that in the XML grammar file (coffee.xml), you not only numerically defined each drink characteristic individually, but you also defined each rule and sub-expression. That is, although “Decaffeinated” is defined (VID_Decaf) as a drink characteristic. You also have the OrderList rule (VID_OrderList) allowing you to use multiple characteristics and a top-level navigation rule (VID_Navigation). This convenience allows you to look in one location rather than drilling down into structures. The difference in the two switch statements for ExecuteCommand() is that case VID_EspressoDrinks goes one step further and moves along a possible link list where case VID_Navigation needs to look at this information only once.
Up until now you have been interested in only one event: SPEI_RECOGNITION. SAPI currently handles more than 30 different kinds of events. To better understand the process, suppose you wanted to add two new events to your repertoire, a sound start (SPEI_SOUND_START) and a sound end (SPEI_SOUND_END). The speech recognition engine triggers these events when the microphone detects or stops detecting sound. In the best case scenario, it is your voice triggering this event. However, it may be another sound such as a phone, a cough, or any one of a myriad of cacophonous noises.
It is the event SPEI_SOUND_START that initiates SAPI’s recognition attempts. Once detected, SAPI will start processing on the audio stream. It will listen to and process the stream concurrently. That is, you don’t have to stop speaking before it attempts to recognize your voice. After SAPI detects a sound start, it begins processing and sends an SPEI_PHRASE_START event. During the recognition process, interim SPEI_HYPOTHESIS events indicate attempts are being made. In brief, a hypothesis is the current best guess about what you have spoken up to that moment. Having spoken “Please go to the counter,” SAPI might return with five SPEI_HYPOTHESIS events (one for each new word spoken added) along with a phrase structure with the actual words in it.
An SPEI_SOUND_END occurs after a pre-determined amount of time passes during which SAPI detects no useful sounds. After enough silence, SAPI assumes you have stopped, or paused between phrases or sentences. SAPI then finishes the recognition process and returns the final decision about what was spoken. Rather than sending back one last SPEI_HYPOTHESIS, it returns one (and only one) of three values.
An SPEI_RECOGNITION event indicates a word match (from one of the available grammars) and with a sufficiently high confidence value to consider recognition successful. As an example, in the CoffeeS0 application you might have said, “Please go to the counter.” This matches a grammar rule and SAPI returns SPEI_RECOGNITION.
An SPEI_FALSERECOGNITION indicates that you probably spoke words (as opposed to a sound of coughing) but SAPI could not find a close enough match to either existing words or grammar rules. For a CoffeeS2 example, you could have said, “please go to the veranda,” or have mumbled inaudibly as morning coffee drinkers are prone to do. Since the former case does not match an existing rule and the latter case implies no word could be recognized, SAPI returns SPEI_FALSERECOGNITION.
An SPEI_RECO_OTHER_CONTEXT indicates a successful recognition was made but that another other application currently running claims it. This is a useful event if there are multiple shared instances running at the same time. For example, if you had said, “please go to the veranda.” CoffeeS2 does not have a rule covering this but suppose another application did. The second application, even if not currently the active one, receives an SPEI_RECOGNITION event and CoffeeS2 gets SPEI_RECO_OTHER_CONTEXT. In a way it offers closure for CoffeeS2.
Since there is a range of events possible and not all of them are relevant to all applications at all times, SAPI can filter them. Using ISpEventSource::SetInterest, you can determine which events you want to see. Those events not included in the call will not get generated. By default, that is if you never even call ::SetInterest, only SPEI_RECOGNITION events are generated. To allow other events, you would have to modify the ::SetInterest call and explicitly include them for SAPI. For instance, in InitSAPI():
hr = g_cpRecoCtxt->SetInterest(
	SPFEI(SPEI_RECOGNITION)|SPFEI(SPEI_SOUND_START)|SPFEI(SPEI_SOUND_END),
	SPFEI(SPEI_RECOGNITION)|SPFEI(SPEI_SOUND_START)|SPFEI(SPEI_SOUND_END)
);
//Check return value
As expected, SPEI_SOUND_START and SPEI_SOUND_END are added. Notice if ::SetInterest is called, you have to explicitly add SPEI_RECOGNITION along with any other events. The second parameter is the list of events you want queued. Events can happen faster than even the processor can handle. If this is the case, rather than losing the events, they are put in a queue where they wait until they can be processed. Most of the time, you would want the two parameters to be identical. If you were interested in an event in the first place, you would also be interested in doing something with it. However, this is not always the case and SAPI lets you decide.
If you see the events, you will also want to handle them. You can add the two events to the recognition event loop in ProcessRecoEvent().
case SPEI_SOUND_START:    
	PostMessage( hWnd, WM_STARTEDTALKING, 0, 0 );   
	break;
case SPEI_SOUND_END:    
	PostMessage( hWnd, WM_STOPPEDTALKING, 0, 0 );  
	break;
These post messages back to the hWnd window. CoffeeS1 does not handle these messages past this point but it does demonstrate how it could be done.
Moving from the hypothetical to the practical, CoffeeS2 introduces a new event, SPEI_FALSE_RECOGNITION. As explained earlier, the event indicates that you spoke a word but that word was not found in the command and control list. If you simply made a noise or if the microphone picked up spurious noises, SPEI_FALSE_RECOGNITION will not be returned. In fact, in those two cases, no recognition takes place because SAPI is smart enough to tell the difference between words and noises. Only sounds close enough to real or acceptable words will trigger an SPEI_RECOGNITION or SPEI_FALSE_RECOGNITION event. At least that’s the intent. It is the responsibility of the speech recognition engine to attempt to detect the difference. However, due to the wide latitude of sounds possible and differences among vendors, an occasional SPEI_FALSE_RECOGNITION may be returned instead.
In addition to SPEI_FALSE_RECOGNITION, CoffeeS2 also introduces a time element. You can determine if the patron had spoken a legitimate command and also if that command was spoken over a predetermined amount of time. You can also determine if the utterance was intentional. The temperamental barristas that CoffeeS2 employs ignore a false recognition if spoken too quickly. Otherwise, they will politely ask for the order to be repeated.
First you need to set the interest. In CounterPaneProc(), the WM_INITPANE message actually controls this.
hr = g_cpRecoCtxt->SetInterest(
	SPFEI(SPEI_RECOGNITION)|SPFEI(SPEI_FALSE_RECOGNITION),
	SPFEI(SPEI_RECOGNITION)|SPFEI(SPEI_FALSE_RECOGNITION)
);
Notice that the interests are chained together with bitwise (rather than logical) OR statements. This way you can add multiple events at the same time. However, the method is in an odd place located in CounterPaneProc() instead of the initialization routine of InitSAPI(). Remember, you can order drinks only in the counter area. Therefore, you want this event available only after the patron has entered the counter area. Although there are several ways to approach the programming logic of this problem, CoffeeS2 sets the events as the patron enters different areas. In contrast, the WM_GOTOOFFICE of the same CounterPaneProc() routine sets the interests again.
hr = g_cpRecoCtxt->SetInterest( SPFEI(SPEI_RECOGNITION),SPFEI(SPEI_RECOGNITION) );
The same logic is used for rule activation and deactivation described above.
When you set the interests at the time of entry to a room, you can now have one message handling routine for all speech events. This is the same process as in the previous Coffee examples, that is, ProcessRecoEvent() handles the messages at that point, and each event is assigned to an action in the subsequent switch statement.
Use HandleFalseReco() in the case of SPEI_FALSE_RECOGNITION. You may use IspRecoResult::GetResultTimes to determine the time element. This passes back an SPRECORESULTTIMES structure containing different timing information for the event. Specifically, dwTickCount keeps the time from the start of the event. Subtracting this from the system’s time yields the duration of the recognition. If the false recognition took longer than the arbitrarily determined value of MIN_ORDER_INTERVAL, the patron is asked to repeat the request.
| You are reading help file online using chmlib.com |