when to correct a misrecognition
Earlier, Chuck (I believe it was) was discussing how you shouldn't always correct misrecognitions (using "correct that"), ideally you should listen to the playback to check if you were speaking clearly or mumbling. Only correct when you were speaking clearly and distinctly. Otherwise Dragon NaturallySpeaking will get confused and eventually completely muddled.
I don't always have playback available to me, plus I don't always want to take the time to follow this procedure. One thing I've been doing for misrecognitions is saying "correct that" and looking at the list: sometimes the list has little relation to what I actually said. For example I tried to say "misrecognition" and the list has nothing but "his wrecked ignition" and similar wacky phrases. In that case I figure that I must have really mumbled or mispronounced. On the other hand, sometimes the exact phrase I wanted it in the list. Then I go ahead and choose it.
Especially when the exact phrase I wanted is number one in the list, I figured that I pronounced it pretty well.
Can Chuck or someone who understands Dragon NaturallySpeaking confirm whether this sounds like a good idea?
Mike

mossey wrote:Earlier, Chuck
Earlier, Chuck (I believe it was) was discussing how you shouldn't always correct misrecognitions (using "correct that"), ideally you should listen to the playback to check if you were speaking clearly or mumbling. Only correct when you were speaking clearly and distinctly. Otherwise Dragon NaturallySpeaking will get confused and eventually completely muddled.
Not exactly correct. If the playback is clearly what you said, or intended to say, and the correct word or phrase is in the selections shown, then select the correct word or phrase, but don't train it. Again, the reason is that what you said originally was said in the context of your total dictation. If you train your correction under this condition you run the risk of training the word or phrase out of the total context of you original dication (i.e., we tend to speak differently during normal speaking than when we pronounce a word or phrase during correction). This changes the Acoustic Model representation for the way you normally speak. If the difference is significant, then you alter the spoken form such that it may result in increasing the propensity for the correction to be misrecognized. It is important to note that a one time correction/training will not generally result in future misrecognitions. However, doing this continually eventually alters the Acoutic Model in a negative way, thus causing accuracy degredation, which requires correction, which further exacerbates the problem. In short, it is the repitition of continually correcting and training the same word or phrase over time that is one of the causes of accuracy degredation.
The other part of this is that if a word or phrase is misrecognized, you should always correct it, but only train it under the above conditions. DNS does not learn from your normal dictation. The speech models and the recognizer work on the principle that if you do not make a correction, what was said and what was displayed are correctly associated. Only corrections train the speech models, and DNS only learns from correcting misrecognitions. Acoustic and Language models are distinctly different and how each is modified is distinctly different. The Language Model is only modified by document analysis. The Acoustic Model is only modified by training. When performing corrections, selecting the correct word or phrase from the selection options, or typing the correction, is based on the document analysis algorithm and changes the Language Model (context). Training changes the Acoustic Model. Performing corrections links the Acoustic Model to the Language Model either in terms of what was said originally or in terms of any training done during correction. It is critically important to remember these distinctions. This is the way SR works. Unfortunately, it is not subject to debate. SR algorithms work the way they do whether we disagree with them or not.
I don't always have playback available to me, plus I don't always want to take the time to follow this procedure. One thing I've been doing for misrecognitions is saying "correct that" and looking at the list: sometimes the list has little relation to what I actually said. For example I tried to say "misrecognition" and the list has nothing but "his wrecked ignition" and similar wacky phrases. In that case I figure that I must have really mumbled or mispronounced. On the other hand, sometimes the exact phrase I wanted it in the list. Then I go ahead and choose it.
First, if what you mean is that Automatic playback on correction is turned on, but you do not always have playback when you correct something, then your initial statement is valid. However, not always wanting to take the time to follow this procedure is ignoring how DNS works. Since you take the time to correct a misrecognition, it does not add significantly to the time required to ensure that you use the correction function correctly. Not doing so is anathema to improving accuracy. There are three principles involved here.
1. Practicing correct correction techniques saves time because it increases overall accuracy and reduces the necessity to continually deal with misrecognitions.
2. Practicing correct correction techniques continuously results in their becoming instinctive and automatic, which in turn saves time.
3. Failure to practice correct correction techniques generally adds 3 to 4 times the amount of time required in overall loss of dictation speed and general loss of productivity. If you save 10 minutes by cutting corners you add up to 40 minutes to your overall time to complete dictation tasks, and loss of productivity increases the more you practice incorrect techniques.
Second, if your user accuracy is optimal, then the correct selection option in the Spell dialog almost always shows up as #2. Occasionally it will show up as #3 or #4. Rarely, if ever, will it show up as #5 or greater (keep in mind that DNS only selects the 9 highest probability matches whether in the correction function or during normal dictation). If you are getting the misrecognitions you cite as "wacky" phrases, then something is very wrong with either your hardware, your user, or some combination of the two.
Third, correct phrases containing misrecognitions, not words. This adjusts (corrects) the context recognition (Language Model) exactly the same way that analysing documents does. In fact, both of these functions use the same algorithm. Correcting individual words simply does not work unless the word being corrected is "wacky" (i.e., totally wrong and there is no selection in the Spell dialog or Correction window. Whether this occurs with regards to words or phrases, you should train both the correction and the misrecognition. This is probably the one circumstance where training should always be done. Otherwise, train words by using either the Vocabulary Editor or the "Train a word" command. Vocabulary probability (words by themselves and not based on context) is fixed and never changes regardless of training or correction. The only changes that are ever made to the Vocabulary (Monogram Language Model - non context based), are relative to the underlying phonetic pronunciations. When you train an individual word, the underlying phonetic representation of the pronunciation is adjusted accordingly, but the probability coeficient of any word, other than custom words that do not exist in the background vocabulary list, remains the same. Probability coeficients are modifiable only for the bigram and trigram language models (context)and custom words. The reason for this is too complex for discussion in this post. Suffice it to say that there is a critically important reason for this.
Fourth, when correcting phrases, DNS applies the appropriate Language Model based on the number of words:
1. One word - Vocabulary (Monogram LM)
2. Four words or less (Bigram LM - context analysis)
3. Five words or more (Trigram LM - context analysis)
It is sufficient to note here that invoking the Trigram LM will result in greater accuracy, as well as the speed of display, than the Bigram or Monogram LM's. This is why training and correction of phrases of 5 or more words is more effective at optimizing accuracy. Explaining why beyond this is again complex beyond the scope of this post.
Fifth, if you have the Automatic playback on correction feature turned on, and you do not hear automatic playback of your original dictation of the word or phrase being corrected, this means that there is no underlying related to speech in the speech buffer. This can be because you made a correction manually to the text containing this word or phrase, or, for some other reason, the text buffer was flushed. As I stated in a previous post, this is a tricky condition. Sent you have no playback of your original spoken dictation of this word or phrase, you have to decide whether or not you said to correctly in order to decide whether or not you should train it. Under this condition, if you elect to train your correction, or the correct selection option, only that option will show up in the subsequent training dialog. If there was playback, one way of determining this is that if you select a correction option, or type or dictate the correct word or phrase in the text box at the top of the Spell dialog, and then select train, both the correction and the misrecognition will show in the training dialog box. The recommended procedure that I generally use is to assume that what you said originally was correct (your spoken form), and simply select or type the correct option and click OK without training. This is usually the most appropriate, effective, and optimal choice.
Lastly, never, and I repeat never, use the correction function to change a word or phrase that is correct (i.e., not a misrecognition). Select it and redictate what you wanted to say (Select-and-Say). Doing so can result in entering contradictory and conflicting links between the Acoustic Model and Language Models. The result can be total misrecognition of the words or phrases involved, and is likely one reason why accuracy degrades significantly for some users.
Especially when the exact phrase I wanted is number one in the list, I figured that I pronounced it pretty well.
#1 is the word or phrase that was displayed during the original dictation. If it is correct, then you are not correcting a misrecognition, you are correcting a word or phrase that was initially correctly recognized. If you look carefully at the options in the Spell dialog or Correction window, you will notice that selection #1 has no words in bold. This means that these were the words that were displayed when you originally dictated that word or phrase. If they are incorrect, than the correct word or phrase should be between #'s 2 and 9. They selection options will have at least one word bold, which indicates the correction that will be made to the original word or phrase. In short, if #1 in the selections is the correct word or phrase, that it was never misrecognized to begin with. However, this is again another reason why the playback option is important in the proper use of the correction function.
There are numerous other training and optimization techniques that augment these and improve the accuracy of DNS, but these are better left to a subsequent post.
Chuck Runquist
Former DNS SDK & Senior Technical Solutions PM for DNS with Lernout & Hauspie (L&H)
By "correct a phrase", do
By "correct a phrase", do you mean include the words on either side? Or 4 words? or 5?
I've noticed that I have greater hope of getting what I want in the correction window, the smaller the phrase.
If I have a large phrase with no valid alternatives, I'm forced to enter spell mode, and then spell out or edit an entire phrase. It is much faster to correct individual words. In cases where I suspect that I'm going to get a bad phrase (mostly because I'm using new technical language) I find it useful to say words one at a time and correct them one at a time. This is because a phrase could easily have several adjoining errors.. a nightmare of correction.
mossey wrote:By "correct a
By "correct a phrase", do you mean include the words on either side? Or 4 words? or 5?
A phrase can be several words. We generally only make corrections to to words and sometimes three words. For example, in the misrecognition in the previous sentence, we would only select, "to words." The word "and" after the word "words" is extraneous.
We've been using and training people in speech recognition for over 12 years. We've found the people who are most successful quickly, are users who try different things to see what works. I suggest your question so simple as to have answered itself or was easily verifiable by trying it several times and different ways.
--
Martin Markoe, eMicrophones, Inc.
The best microphones for Speech Recognition
See us at: http://www.eMicrophones.com/index.asp
Read, "Key Steps to High Speech Recognition Accuracy" at:
http://www.emicrophones.com/docDetails.asp?Documen...
Martin Markoe wrote:A phrase
A phrase can be several words. We generally only make corrections to to words and sometimes three words. For example, in the misrecognition in the previous sentence, we would only select, "to words." The word "and" after the word "words" is extraneous.
We've been using and training people in speech recognition for over 12 years. We've found the people who are most successful quickly, are users who try different things to see what works. I suggest your question so simple as to have answered itself or was easily verifiable by trying it several times and different ways.
I'm going to make a dangerous assumption and correct Marty on one point. If I'm not mistaken, Marty is making an incorrect assumption about bigram and trigram models.
The bigram and trigram algorithms can be graphically represented similarly to the tree structure that you see in the folder tree in Windows Explorer. However, unlike the folder tree, the tree structure in bigram and trigram algorithms is right side up and top down, and each has a left context and a right context. That is, the search through these algorithms starts with a word and progresses downward simultaneously through the branches on each side of the tree, which is composed of the probability matrix for two word (bigram) and three word (trigram) context combinations (i.e., the current possible combinations of word contexts coded in these Language Models based on how and whether these LM's have been modified via document analysis and/or via making corrections as noted in my previous post).
In the bigram model this search is conditioned on two words. However, a bigram search consists of the one word context that includes the target word AND the word on the left as well as the word on the right of the target word. As I have said, this left/right search is performed simultaneously until the top 9 most likely matches are found (BestMatchIII technology used in DNS). The match with the highest probability is either displayed during dictation, or shown as the #2 option in the Spell dialog/Correction window.
The trigram model works identically, except that it contains the tree of the possible two word combinations to the left and right of the target word.
With this in mind, the best selections for making corrections are actually 3 words (bigram) and 5 words (trigram) done so as to include the word to the left and the word to the right of the target word (bigram), or the two words to the left and two words to the right of the target word(s) (trigram).
In addition, althought the time required to search for best matches using these algorithms is minimal and depends on the Speed vs. Accuracy setting in the DNS Options|Miscellaneous tab, it does take slightly longer to search using the trigram algorithm than it does for the bigram algorithm. Also, keep in mind that DNS searches the Monogram (vocabulary)model first to see if it can find an exact match without invoking the bigram and trigram models. This is the fastest search. For example, some words, notably your custom vocabulary, proper names and unique words/phrases, such as acronyms or words/phrases that are not context essential (William Jeferson Clinton), are generally not misrecognized by themselves unless they are not in the Active Vocabulary.
Further, there are three general types of misrecognitions that determine how you should select words/phrases for correction.
1. Single word misrecognitions (an vs. and, or vs. are, etc.)
2. Substitutions (one or more words substituted for a single word spoken; such as, "and the" for "and")
3. Multiple word miscrecognitions ( I fed the pigeons to the flag vs. I pledge allegiance to the flag - an actual reported case of this type)
Each of these requires a different approach:
For #1 the best approach is the bigram selection (target plus word to left and word to right - 3 words)
For #2 the best approach is the trigram selection (target substitution misrecognition plus two words to the right and two words to the left). This is generally 6 words or less depending on the number of words substituted, but works best using the trigram model selection approach.
For #3 the best approach is the trigram selection, but depends on the number of words to be corrected. In the example above, which is the most common form or type of multiple word misrecognition, the best selection would be "I fed the pigeons to".
Lastly, let's recap for general rules.
1. Always correct phrases, not words, unless the word misrecognition is a glaring and significant one (example - fed vs. pledge). In this case both correction and training the difference is viable and important because the problem lies in both the misrecognition and the Acoustic Model - That is both words are in the vocabulary. They are just not being correctly represented in the speakers phonetic speech patterns (Acoustic Model).
2. Generally use 3 word or 5 word selections for correction (target word plus 1 or 2 words to the left and right of the target (misrecognized) word. In 90% of correction issues the bigram (3 word) selection is sufficient and recommended. However, the trigram model selection (5 words) doesn't hurt, and may even help to improve accuracy in some cases (see below).
3. Select the number of words to correct based on the complexity of the context. That is how dependent is the target (misrecognized) word/phrase on the one or two word left and right context words?
4. What type of misrecognition is involved (i.e., 1, 2, or 3 above) should determine the number of words selected for correction.
5. If a misrecognition invloves a single word and the misrecognition is not even close to the spoken/dictated word (fed vs. pledge or pigeons vs. allegiance) it is a good rule of thumb to first check the vocabulary to see if the spoken/dicated word is present. If not add it to the vocabulary and train it there. The reason for this is that users report extraneous words being added to their custom vocabulary without their awarenes and/or wish to do so. This is caused by correcting for and/or training unknown (i.e., not in the backup vocabulary) words during correction. If you want to avoid this, correct by voice (Select-and-Say) rather than use the correction function. Larry Allen made a good point when he suggested some time ago that the fist step should be to select the misrecognition (using the same criteria for selecting misrecognitions for correction) and redictate such. If they are correctly recognized on redictation, then using the correction function is superfluous and unnecessary unless or until you experience repeated occurrences of the same misrecognition. In many cases, misrecognitions are enunciation errors, rather than misrecognitions. It is important for optimal user accuracy to make this distinction and respond accordingly.
Lastly, keep in mind that when dictating DNS uses the following flow to search for best matches:
1. Vocabulary is checked first because the vocabulary (i.e. Active vocabulary) contains both words and their pronounciations. the Speech models work together to establish the phonetic context first and then compare the analysis to the vocabulary to transcribe the phonetic strings to words.
2. The bigram and trigram models are phonetically based, not word based. They are used to determine the best match between what the speaker said (acoustic/phonetic strings) and the best phonetic match.
3. Once the best match has been determined, it is then recompared to the vocabulary for transcription to the displayed text.
Remember that this entire process is done acoustically/phonetically. Until the final transcription display, the process does not involve words, like looking a word up in the dictionary. It involves many complex mathematical algorithms all based on binary phonetic comparisons and probability matrices. That is, analog speech to digital strings, digital strings to binary (digiatal) phonetic representation, comparison to the equivalents in the language models, and finally transcription of the binary phonetic results into word equivalents based on the best match found using the vocabulary. This is overly simplified, but as close an approximation as possible without getting into the complexities of statistics and Hidden Markov Models.
Chuck Runquist
Former DNS SDK and Senior Technical Solutions PM for DNS with L&H
Oh my goodness! This one is
Oh my goodness! This one is going straight into my little (actually now rather middle sized) "Sayings of Chairman Chuck" collection
This seems to me to be a comparatively comprehensible summary of the recognition process, and it is well-tied to the practical implications that derive from it.
Bruce
Bruce, I think you should
Bruce,
I think you should make your own contribution, by reading the article linked below regarding Hidden Markov models and summarizing it for the rest of us!
http://www.cs.brown.edu/research/ai/dynamics/tutor...
Well OK! SpeechCompute
Well OK! SpeechCompute should hold its collective breath because this shouldn't take long.
Bruce
PS: Maybe you could let your breath go once or twice -- I'm trying to distill it down to one dynamic equation and accompanying animated graph for the sake of brevity
Good luck. Just keep in mind
Good luck. Just keep in mind that this is a general explanation and DNS uses variations on the general theme. That is, the way they are applied in DNS is not exactly identical to the information in the article.
Chuck
Thanks for the warning, but
Thanks for the warning, but I think we moved out of the literal zone two bumps or so back
Bruce
Matt Chambers wrote:Bruce, I
Bruce,
I think you should make your own contribution, by reading the article linked below regarding Hidden Markov models and summarizing it for the rest of us!
http://www.cs.brown.edu/research/ai/dynamics/tutor...
I think it says if you're hiding Markov and if Viterbi finds out you are in deep Baum-Welch juice. But don't quote me on that, I could need another gallon of Scotch to make sure.
Gosh Darn it, Skip! That's
Gosh Darn it, Skip! That's EXACTLY what I was going say -- to a phoneme! Not fair when the admin jumps on contributors' turf
Oh well, guess its time to get back to work. Now who took my Racing Form? I mean, stock market listings?
Bruce