Chapter 3
With the basic structure for speech recognition firmly in place, CoffeeS1 expands word recognition capabilities. You navigate around the coffee shop, and you can place orders for different kinds of coffee drinks.
The example will focus on grammars generally and, on command and control grammar specifically. The following topics will be discussed
· Grammars: Command and Control, dictation.
· Phrase: Phrase structure, word recognition.
· Grammar Files: XML tagging.
Command and Control
The last example (CoffeeS0) was not robust. You were limited to about five words. However, they are five words of special interest and using them, you could move around the application. Using grammar in the command and control function limits the use of words and bestows upon them specialized meanings. This is convenient for some uses in applications. In the last chapter, you learned that grammars could have limited recognition contexts such as for menus. As an example, you would want the application to respond, or even to attempt to respond, to certain words relating to menu items or to the menu bar such as “file,” “open,” or “print.”
These words are provided in an exclusive list. If the word is not found, it is not recognized. Also, word order matters in some cases. In CoffeeS0, “Go to counter” was understood and “Counter go to” was not. Using a list approach is also called rule-based or context-free grammar. Words are evaluated according to a fixed set of rules. In short, the word is either in the list or it is not. There is no attempt to figure out the intent of the word based on the words that came before or after it. That is, there is no context for the words.
SAPI 5 uses extensible markup language (XML) to create this list. The file may be generated ahead of time or compiled during program execution. Because command and control deals mostly with lists, words can be added dynamically and can accommodate new situations easily.
CoffeeS1 addresses word order and the ability to sequence words.
Dictation
Command and control has obvious shortcomings. As mentioned, it is limited in the words used. Someone has to spend the time to manually define the command set. Often, you will want to speak any word and have it recognized. This is what a traditional speech recognition (SR) program does. That is, you can dictate any word, no matter how esoteric, into a word processor and have that word translated into text. For this use of a speech recognition engine, you must move from command and control to a dictation grammar. Instead of an XML-based vocabulary, dictation grammar uses a much more extensive range of words and determines each word based on context. The words immediately before and after it are studied and dictation grammar chooses the most likely outcome. For this reason, this is also called a statistical language model (SLM).
SR engines have wide latitude of vocabularies. The Microsoft SR engine that SAPI 5 includes 60,000 English words and provides an adequate engine for most people. Other engines are specialized for the legal and medical professions, for example. These can be massive databases generated by commercial firms. In addition, different languages including Japanese, Chinese, German, and Russian are also available.
For as widely disparate as these languages and usages seem, SAPI 5 handles them in the same way. The programming approach is very similar. Two other samples provide a dictation approach to speech recognition: Simple Dictation and Dictation Pad. These may be found on the SAPI 5.1 SDK and are documented separately. Coffee on the other hand, limits itself solely to command and control usage.
SAPI returns the actual recognized words through a series of structures collectively called phrases. You have seen evidence of this with SPEI_PHRASE_START event indicating the start of the recognition process. For command and control uses, it is a two-step process: Determine the activated rule, and then inspect the elements (or words) within that phrase.
CoffeeS0 briefly introduced the first step. While processing a recognition event, you discovered which rule was activated but stopped there. CoffeeS1 takes the next logical step to recognize the exact words used so patrons can get their drinks.
This examination takes place in the ExecuteCommand() routine. As in CoffeeS0, one of the parameters is the phrase. Remember, this phrase is the final result, rather than a hypothesis, so assume SAPI is savvy enough to translate exactly what you said. You will be depending on this phrase for navigation. At this point, you are only interested in which grammar rule was activated. That means the patrons still cannot go to different places in the shop even though they might request to do so. All navigation statements always lead to the counter.
However, CoffeeS1 introduces a new grammar rule: VID_EspressoDrinks. Defined in coffee.xml, this rule lists all drinks available to the customers. Actually, it is several rules bound together that will be discussed later. Again, you are only concerned in the top-level rule of VID_EspressoDrinks. If you place an order that matches this rule, the rule activates and the result is passed back from SAPI. In typical demanding coffee shop fashion, this could be “Get me an iced decaf single tall peppermint whole espresso.” Orders could even include “single triple short tall grande,” and still be valid SAPI grammar although it might raise an eyebrow (if that were possible).
With the order placed and recognition successful, CoffeeS1 now gets to the task of dissecting the phrase. From the original phrase, an IspPhrase interface contains the method GetPhrase()to construct the elements (or word) list.
SPPHRASE *pElements;
if (SUCCEEDED(pPhrase->GetPhrase(&pElements)))
If successful, pElements contains all the information required to construct the sentence. To determine which rule activated and then to learn more about it, look at the Member Rule. This is a structure (SPPHRASERULE) but one that fully describes the rule. The rule ID is found in its member ulId. CoffeeS1 numerically defines the rule VID_EspressoDrinks in the XML file, so that matching becomes easy. Use a simple switch statement in the code to determine the more specific handling routines.
Two things need to be pointed out about the upcoming word list. First, the words are represented numerically rather than by a string. Associating the value of the word to the string itself uses a look-up table. In this case, CoffeeS1 stores the words as a resource in the application.
Second, the actual words are formed by a link list with each word represented by a member in the sequence. The first element is a structure (of type SPPHRASEPROPERTY) pointed to by pElements->pProperties and each subsequent structure uses the SPPHRASEPROPERTY’s pNextSibling member. Traveling this chain is standard link list operation.
case VID_EspressoDrinks:
// This memory will be freed when the WM_ESPRESSOORDER
ULONG *pulIds = new ULONG[MAX_ID_ARRAY];
const SPPHRASEPROPERTY *pProp = NULL;
int iCnt = 0;
if ( pulIds )
{
ZeroMemory( pulIds, MAX_ID_ARRAY * sizeof(ULONG) );
pProp = pElements->pProperties;
// Fill in an array with the drink properties received
while ( pProp && iCnt < MAX_ID_ARRAY )
{
pulIds[iCnt] = static_cast< ULONG >(pProp->vValue.ulVal);
pProp =pProp->pNextSibling;
iCnt++;
}
PostMessage(hWnd, WM_ESPRESSOORDER, NULL, (LPARAM) pulIds );
To inspect the elements, the code steps through the links one node at a time until the next node is NULL (meaning there are no more nodes to transverse) or it has already visited at least MAX_ID_ARRAY number of nodes. CoffeeS1 imposes this MAX_ID_ARRAY limitation.
Besides stepping through the link list, this code also stores the words in an internal array for later processing. This not only keeps a record of the words but also helps with sorting. Remember, don’t worry about word order. Customers can say “get me a mocha two percent tall,” and still end up with a tall two percent mocha. However, if you do change the word order then you need the ability to sort internally. To indicate empty array elements, flag them with a zero, hence the Win32 ZeroMemory()call. You can use other methods; this one was just convenient for this example.
After going through the list, CoffeeS1 is ready to display the newly derived information. A message is passed to the owning window (WM_ESPRESSOORDER) indicating the application has additional processing. At this point, SAPI is no longer involved. SAPI will even free the objects it created although CoffeeS1 must manually free pElements since it manually created it. Even so, COM is smart enough to delete any nodes in the link list associated with the list. The rest of the processing is on CoffeeS1’s part and mostly to update the screen. When you speak again, the whole process above is repeated.
As mentioned, the Coffee examples use command and control grammar. This is a discrete list of words associated with certain rules. Coffee keeps this list in two forms. An XML-based file allows you to maintain this list. Ultimately, SAPI can only read a binary or compiled version of that file. This is a grammar configuration that is saved with the .cfg file suffix. It was by clever design that CFG not only means “configuration,” but also “context-free grammar.” Approbation aside, grammar files may be generated dynamically during the application’s run time. If it is provided with only an xml file, SAPI will compile the file automatically and use the resulting grammar. On the other hand, the grammar may also be compiled ahead of time by the programming team. This restricts access to the vocabulary so users cannot change grammars unexpectedly. This method is also faster for applications since no compiling time is required during operation. The SAPI SDK application provides a compiler called GramComp. Grammar compilation using this tool is documented separately.
SAPI defines the XML tags and their uses and lists them in Reference API. For a more complete discussion, see Text grammar format. As a brief overview of the structure, look at coffee.xml in the CoffeeS1 project. There are several rules defined but only two are considered top level: VID_EspressoDrinks and VID_Navigation. These are the significant rules for SAPI. When a rule match is made, it is one of these IDs that is passed back to the application. Also, look at ExecuteCommand(). The two case statements coincide with the top-level rule names.
The TOPLEVEL tag within the RULE statement gives these rules their special status. Not only does this identify the rule as being top level, but it also sets the activation state. Only top-level rules may be activated or deactivated. SAPI recognizes active rules and conversely does not recognize deactivated ones. The application may change the state of the rules during execution. If a rule is no longer needed, it may be deactivated. This allows you to turn rules on and off based on the current recognition context. For example, if you have a menu or menu item deactivated, SAPI will not need to attempt to recognize the words associated with it. When the menu is active again, the rule will likewise be activated.
The words or phrases are listed inside the rule. The words or phrases may be optional or required. As the name implies, optional words are not required for a successful rules match. SAPI adds them as a convenience to the speaker. “Please enter the shop” is natural and pleasant sounding as opposed the demanding version of the statement. Required words are, of course, required. However, you can present an alternative word list from which any one word can be used to complete the match. In the case of VID_Navigation, you can say either “enter,” or “go to,” but not both.
In the same manner, you may reference other rules but not other top-level rules. Continuing the VID_Navigation example, the last portion of the requirement is that the rule VID_Place must be successfully matched. The three alternatives, “counter,” “shop,” and “store” are defined as VID_Place. If you say one of these three words, the rule is successfully matched. Upon successful completion of all the requirements, the top-level rule, VID_Navigation matches, and an SPEI_RECOGNITION event passes back to the application.
Additional study of coffee.xml helps you understand how complex rules are constructed. The other rules are basically the same format and follow the same structure. Look up unfamiliar tags in the “Text grammar format” section of the reference API. Curiously enough, the program is case sensitive. “The” and “the” may be duplicated as entries. They may even have the same ID such as <ID NAME=”The” VAL=”1” /> and <ID NAME=”the” VAL=”1” />. While this case has the same pronunciation, consider other words such as “Polish” and “polish.” This applies equally to rule names. There is no requirement for engines to recognize the words as different; however, engine vendors may want to do so. By making the word case sensitive, newer engines can take advantage of these differences.
The first portion of the file assigned numeric values to the individual elements. SAPI does not require this, although in the CoffeeS1 example, you can sort the words. The sorting is based on the “VAL=” tag. Remember to keep the words actually found in the array pulIds for this purpose.
Activating the rules is the same as in CoffeeS0.
You are reading help file online using chmlib.com
|