Registering and logging in removes this ad.
Registering and logging in removes this ad.
Vista Speech - Updated Comments (Vista build 5384)
General
Microsoft speech recognition software is much improved compared with the previous versions. It still fails, however, to meet the minimum requirements for professional dictation including adequate speech recognition accuracy and the availability of specialized vocabularies. The following is a specific critique of Vista Speech. It is understood that this is still beta software; however, this beta (build 5384) is supposed to contain the basic feature set of the production speech software.
Audio Hardware
Weaknesses
1. NT/Windows Server 2003 device drivers are typically incompatible with Vista. Very few sound converter Vista drivers are available at this time.
Audio Input Window
Strengths
1. Ability to select sound adapter.
2. Ability to select input type, that is: Microphone, line in, or digital.
3. Option for setting audio level manually.
Weaknesses
1. It is unclear from the VU (volume unit) display where the optimum audio setting should be; for example, at the middle or upper limit of the green display.
2. What would appear to be the optimum audio setting on this display doesn't agree with that of the control panel, advanced speech options, speech recognition, microphone level window display.
3. The low end sensitivity of the VU is inadequate to show ambient electrical and acoustic noise levels.
4. There is apparently no automatic volume control.
5. There is no frequency spectrum display to provide relative indications of signal/noise amplitudes.
Recognition Engine (Microsoft Speech Recognizer 8.0)
Strengths
1. Recognition accuracy is significantly improved compared with previous Microsoft speech recognition engines.
Weaknesses
1. Recognition accuracy is still far behind that of the current leading speech recognition software, NaturallySpeaking.
2. Speech recognition processing is slow, even on a high performance computer system.
3. Recognition accuracy is unusually sensitive to audio volume settings.
4. New words that are trained or existing words that are re-trained are frequently not then recognized correctly. This is one test of recognition accuracy. The Microsoft Speech Recognizer does poorly in this test compared with NaturallySpeaking.
Speech Recognition Training Window
Weaknesses
1. Text to be dictated is displayed in short sentences or sentence fragments rather than paragraphs. This is very upsetting to the normal pacing of dictation.
2. There is no indication of progression of the dictation; for example, highlighting or graying out of text as successive words are recognized.
3. There is no indication of the successfulness of the dictation. You can dictate phrases that are completely different from the displayed text and the program
proceeds to the next display without any indication of there having been a recognition problem.
4. There is no ability to back-up, repeat or skip mis-recognized text.
5. There is no VU display in the training window.
6. There is no user selectable list of choices for additional training after the introductory training.
7. There are no user selectable specialized training texts; for example, business letters or medical reports.
Dictation
Strengths
1. Full-capability dictation into many application programs.
2. Ability to pop-up the correction window by speaking "correct" followed by the phrase to be corrected or by highlighting it and commanding "correct that", meaning correct the highlighted text.
3. "Scratch that" meaning to delete the most recently dictated phrase.
5. Various commands for navigating through text.
Weaknesses
1. There are limitations of dictation into some "Windows standard" textbox controls.
2. There is no control key selection of command or dictation modes.
3. No microphone on/off by control key press.
4. No control key press for selection of post dictation spelling and grammar checks.
5. No option for vocabulary switching.
6. No user selectable, context sensitive control of abbreviations and number formatting.
Correction Window
Strengths
1. Correction window can be displayed by highlighting or voice selecting the text to be corrected.
2. Errant phrases are numbered if more that one instance appears in the text facilitating selection of a specific phrase to be corrected.
2. The lists of alternate phrases contain generally appropriate possibilities.
3. Additional alternates can be displayed by re-dictating the errant phrase.
4. Voice spelling of a new term is well designed and usually works properly.
Weaknesses
1. No user option to re-train a mis-recognized word that is in the main vocabulary.
2. No way to type in a new word - it must be voice spelled.
3. No way to re-check the accuracy of a mis-recognized word that has just been re-trained.
4. No way to train a phrase ( as opposed to a single word).
5. No way to re-train both the corrected word or phrase and the original mis-recognized phrase.
6. No way to specify and train both actual spelling and "spoken as" representations of words or phrases.
Vocabulary
Weaknesses
1. Entries are limited to single words.
2. No way to specify and train both actual spelling and "spoken as" representations of words or phrases (as above).
3. No capability to display, search, sort, edit, add, delete and train any word or phrase in the main vocabularies. There is limited editing capability for user the vocabulary only.
4. No current availability of specialized vocabularies; for example, legal or medical.
5. No user option for adding specialized vocabularies.
Utilities
Weaknesses
1. No option for backup and restore of user (training, options, etc) and vocabulary files.
2. No option to add, delete, edit and execute user developed macros.
3. The option for processing "typical" user documents is limited to files stored in My Documents. The user has no control over the choice of directories or the specific files to be screened.
4. Typical document screening lacks the important functions of identifying and listing by frequency of occurrence the new words that are located in the documents. There are no user options for adding and training the new words.
5. Testing of the document screening from a custom SDK function did not result in any improvement in recognition accuracy.
SDK/SAPI 5.3
Strengths
1. Extensive set of APIs.
Weaknesses
1. No backward compatibility with SAPI 4.
2. SAPI 5.3 is only available for the Vista platform. Microsoft has decided, at least as of this date, not to supply a NT/Windows Server 2003 SAPI 5.3 based SDK.
3. The automation versions of many of the SAPI 5.3 functions are still not available. Some may never be implemented.
4. There are still significant bugs in multiple Microsoft provided SAPI 5.3 based automation functions and utilities.


Thanks for the detailed
Thanks for the detailed report, Robbie.
I wanted to post a couple of comments on your comments...
There is a system wide hotkey to turn speech recognition's microhpone on and off. It's .
You can use the Easy Transfer Wizard to backup and restore your speech profile, although it's a bit cumbersome to do that. We'll likely have a power toy that you can download from microsoft.com to back and restore your profile sometime shortly after Vista is widely available.
If you include more directories in your search index, those files will also be included in the system's "harvesting" of your vocabulary and language models...
I'm curious what function you tried to use in your "custom SDK function did not result in any improvement in recognition accuracy"?
It's true, SAPI 5 is not backwards compatible with SAPI 4, but that break in app compat happened back when we shipped XP. Is there a specific thing you liked about SAPI 4 that you haven't yet found a way to accomplish using SAPI 5?
SAPI 5.3 is only a slight superset to SAPI 5.1 (which was included in Windows XP), so the thinking is that it's fairly straightforward to support both XP and Vista at the same time using the COM interfaces.
Is there a specific interface that we haven't yet implemented automation support for in SAPI 5.3 that you'd like to see?
Could you send me a list of the bugs that you've encountered with SAPI 5.3 in Visat? I'd be happy to make sure that they get loged in our bug database, and I'll see what I can do to get them addressed if we have sufficient time.
-----------
Rob Chambers
Architect - Microsoft Corporation
Windows Speech Recognition ... "We're Listening..."
This posting is AS-IS and provides no warranty and confers no rights.
I think that one of the
I think that one of the other features that need to be made more explicitly clear is that corrections of msirecognition should be the misrecognized word rather than the phrase. At least, this is what I was advised in the beta user group, and I thnk it importnt for those who are transitioning to Vista from DNS to be aware of this difference. It is hard, at first, to get used to calling only for the word rather than the phrase, but it does seem to cause Vista to get the word right afterwards wehreas correcting it as part of a phrase does not.
Rob and the others have explained for some time that they are seeking the more general market, and I am finding that as I use MS Speech more, I am getting results that are, at least, comparable to my DNS. Robbie is right that the microphone setting process is the least intensive of any I can remember. It is, essentially, click a check box and dictate two sentences with no feedback other than a finish button at the end. yet it does seem to work, and it seems to work with lesser quality microphoens as well. This, too, would seem unsettling to the veteran user but may be more acceptable to broader audience.
MS Speech is, in some ways, more friendly to the user who needs almost total voice control and for the general user who doesn't want to purchase DNS Pro and build macros, it may be a better choice. However, I do wish it had a macro building capacity (something we have been promised at a later date) and I would really like soemthing like the Dictation Box that would allow MS Speech to be used in the non-MS API programs.
Frank Abbott
Hey Frank, Correcting either
Hey Frank,
Correcting either the word or the phrase will “teach” the MS SR engine to do a better job in the future. If you have to resort to spelling, though, its best if you’re doing it one word at a time; that way, the UI will give you the chance to add the words to the vocabulary. Sometimes the engine can do a better job of learning if you correct a single word at a time, though, it’s true.
I’ve heard from many people that the accuracy is pretty similar between DNS and Microsoft’s latest SR engine included in Windows Vista. I have heard from a minority of people that the accuracy isn’t quite as good in Vista, and from a similar number of people that the accuracy is slightly better in Vista. It’ll be interesting to see the feedback from the broader beta that’s now underway with Windows Vista Beta 2.
I’m personally in charge of designing the underpinnings of the new macro facility that we should be able to have out to customers shortly after Vista is widely available. One such macro that I’ve created is something similar to the dictation box you’re referring to. I can say “Insert text”, and up pops a dialog, with full SR control over the dialog and the edit control. Once I’m happy with what I’ve said, I can simply say “OK”, and it’s inserted into the document…
However, if you wanted a slightly different macro, one that also allows you to say “Insert text blah blah blah”, you could even pre-populate the dialog with what you want to insert. Then you just make the corrections (if any are necessary), and then say “OK”. This will all be relatively straight forward to build using our new macro UI, or you can even just pop open Notepad, and create a simple little XML file (once you learn the XML schema), and you’ll have more options available to you than you would with the UI.
One of my favorite macros we have working as a proof of concept right now is “Replace [textInDocument] with [random dictation]”. This is something someone at the Boston Voice Users group (I think it was Kim Patch, but I might be misattributing the request) wanted badly and had on the BVUG top 10 list. We’ve taken that feedback, and lots of other feedback into consideration when designing the new macro facility. It should end up being quite flexible.
--
Rob Chambers [MSFT]
http://blogs.msdn.com/robch/default.aspx
Architect - Windows Speech Recognition - We're Listening...
This posting is provided "AS IS" with no warranties, and confers no
rights.
Rob-- Thanks for the answer.
Rob--
Thanks for the answer. I am glad that we can still correct in phrases because, try as I might, I still found it difficult to try to do it a word at a time. Old habits do die hard. The dictation box macro seems to be just what I am looking for. I use a note program for short notes that I want to file--Evernote--that will be much more useful if I could file my notes by voice. Will you let some of us have a crack at the macro when its available?
Several betas ago, we had to turn off the document harvesting because it would cause a crash when booting up the speech recognition program. Has the beta2 corrected that issue? I have been hesitant to test the harvesting because of the earlier problems, but it did seem to provide a much quicker way to a good dictionary.
Thanks again.
Frank Abbott
Hi Frank, As soon as the
Hi Frank,
As soon as the macro tool is ready for beta testing, I'll definitely post something here on this board to recruit beta testers. We're not 100% sure when that'll be, but it should be toward the end of this year.
I believe we've fixed all known issues with the crashes in the document harvester. I'd love for you to turn it back on, and help us find out if that's true.
If you do get a crash, be sure and click the "Send report" button so we can see it back here in Redmond. We do look at those crash reports, and it drives us fixing bugs that we haven't reproduced locally.
--
Rob Chambers [MSFT]
http://blogs.msdn.com/robch/default.aspx
Architect - Windows Speech Recognition - We're Listening...
This posting is provided "AS IS" with no warranties, and confers no rights.
Hi Frank, The issue of
Hi Frank,
The issue of correcting a word rather than a phrase is an interesting one. Current speech recognition technology is based on a combination of recognizing phonemes and the statistical likelihood that a group of words will occur in a specific pattern. Word speech training is intended to improve phoneme recognition, but, in itself, does not alter the statistical model. Dragon's approach of phrase training has multiple benefits. It corrects the phoneme recognition and it provides a means of updating the statistical model. This approach works very well in practice and makes it possible to avoid repeating many errors. A simple example might be a name like "Neil Smith" that, in a given user's typical documents, should be "Neill Smith". It is easy to train Dragon to display "Neill" instead of "Neil" if the next word is "Smith".
We have invested a considerable amount of time in Vista Speech, primarily working with SAPI 5.3. There are several reasons for the interest including the fact that the development of NaturallySpeaking appears to have slowed and the cost of the software, especially the SDK, is increasing. It would be nice to have alternate speech recognition software, but Vista Speech isn't even close to having the recognition accuracy and general capability of NaturallySpeaking. The Vista Speech problems that are of primary concern are poor recognition accuracy and the unfortunate, and unnecessary, limitations of the vocabulary. The latter include the lack of phrase capability, the lack of "sounds like but spelled as" functionality, the absence of any means to edit the main vocabulary and the lack of legal, medical or other specialized vocabularies. One would not expect Microsoft to create the specialized vocabularies, but third parties will be reluctant to attempt this unless the overall performance of Vista Speech is satisfactory. I should note that NaturallySpeaking has a great method for building a specialized vocabulary that is completely missing from Vista Speech. Processing one's typical document base with NaturallySpeaking results in a listing of new words that are sorted by frequency of occurrence. The users than have the option of training and adding the new words to their user vocabulary. They can thus easily create a specialized vocabulary that is based on the content of their typical documents. This step occurs before the same document database is used for updating the statistical model.
I should mention that we created a Vista dictation control that can be attached to a standard Windows textbox control. It has functionality similar to that of Dragon's DgnDictEdit module. We abandoned this project because the overall problems with Vista Speech cannot be resolved at this time.
Robbie
Hi Robbie, I'm afraid you
Hi Robbie,
I'm afraid you might not have spoken with the right people, then, ifyou think that any of the short comings can't be overcome. Because Vista SR has such an extensive and capable api layer, I do believe you can overcome any perceived shortcomings in the built-in software. It's just a matter of time, will, and effort.
If you're interested, I'd love to chat more about what you could do as an ISV to overcome whatever limitations you see in the platform.
I look forward to hearing from you ...
--
Rob Chambers [MSFT]
http://blogs.msdn.com/robch/default.aspx
Architect - Windows Speech Recognition - We're Listening...
This posting is provided "AS IS" with no warranties, and confers no
rights.
Hi Rob, We have communicated
Hi Rob,
We have communicated with each other in the past and I certainly thank you for your excellent assistance. We haven't had much success with the correction of bugs or with the implementation of new features, but the assistance from the Speech Group has been very useful in working with SAPI 5.3, much of which is still undocumented.
I do agree that SAPI 5+ is a great speech interface and that it provides developers with an extensive range of capabilities.
The principal problems that we are having with Vista Speech are poor recognition accuracy and a vocabulary system that lacks some essential features as have been described above. These problems are separate from those of SAPI5+.
Recognition accuracy depends on many factors an incomplete list of which includes: audio quality, audio levels, ambient acoustic and electrical noise, computer system processing capability, word pacing, diction, the content of the dictated material, the speech recognition software, the extent to which the statistical model matches the word patterns of a user's typical documents, and the content of the speech vocabularies and statistical models. Vista Speech recognition accuracy is equivalent to NaturallySpeaking or ViaVoice for simple testing like dictating "This is a test of speech recognition" or "The quick brown fox jumped over the lazy dog's back". The problem with testing professional applications is that, in addition to all the variability that is introduced by the manner in which one dictates, it is really not possible to compare the performance of a recognition system that contains a specialty vocabulary and its associated statistical model with that of a system that is designed for general speech recognition.
It is obviously possible to dictate the same text into two speech recognition systems and to add new words and to re-train mis-recognized words as may be required. This process can provide some information on the recognition accuracy of the software although it isn't a fair test of optimized statistical models.
A second useful indicator is whether or not a re-trained word or phrase is subsequently consistently correctly recognized. I thought that Vista Speech did pretty well with the recognition of new or re-trained words in build #5308. For reasons that are unclear at this time, this speech recognition has been poor under build #5384. It is my understanding that there have been some changes in the audio processing software and it is possible that the difference is related to this. We are using the same hardware platform and the performance of NaturallySpeaking is unchanged.
Regarding SAPI5+, there are multiple issues.
I have to start with the SpeechAddRemoveWord function. There is a problem with the fourth parameter in the calling arguments. This bug prevents any usage of the function. It has been known for several months and it still uncorrected. Specifically,
RC.Recognizer.DisplayUI(Me.Handle.ToInt32, "My App", SpeechStringConstants.SpeechAddRemoveWord, "someword")
fails with the same error message "Value does not fall within expected range".
Other problems of concern are that the two SAPI5 speech libraries contain different functions with different capabilities, many functions that are available for VC have not been implemented for VB, and, for some odd reason, VB functions are frequently missing some of the VC calling arguments or return data. A typical example is "SetAdaptationData(strBuff)" which lacks the critical return information about what happened when the function was executed.
Finally, I would like to comment on post recognition engine data search and replace operations. We have used this approach with NaturallySpeaking for many years. It is helpful for reducing repetitive errors that occur in well defined, unique circumstances. The famous example is correcting NaturallySpeaking's tendency to substitute "is had" for "has had". Real world data screening involves a fair amount of logic because it is typically necessary to examine what precedes and follows an errant phrase.
Robbie
Hi Robbie, Thanks for your
Hi Robbie,
Thanks for your reply. Sorry this reply is a bit late ...
If you could send more details about what bugs you've encountered and logged (send them to lis...@microsoft.com), I'll try to take a look and see where we are with them. Likely we've fixed many of them, but they didn't get fixed in time to be included in Beta 2. If they're new features, though, (and it's a fuzzy line sometimes) I doubt we've made much progress. Since Beta 2, very little has changed in feature set til now (in current code in the Vista project). The whole division has been heads down increasing the quality and performance of the product.
I believe most of those "limitations" in the vocabulary system can be addressed with the APIs and additional utilities or "features" shipped by 3rd parties. Perhaps you or someone you know could extend the capabilities of Vista SR in just those ways you indicate. If you have questions on how to implement those types of features using SAPI 5.3, again, you can send email to lis...@microsoft.com. We'll try to respond quickly...
There have been some reports that accuracy declined in 5384, and I can assure you that the core technology team is looking into those reporst. I also saw some decrease in accuracy personally. I think that's been addressed, and in current builds the accuracy is back up to where it was at 5308 (and I suspect it'll get a tad better before RTM).
Your specific issue with the DisplayUI call or AddRemoveWords has been addressed (post Beta 2). You should see that in the next official release (if you have access to CTPs, it'll be in the next one).
I do know that the core tech team did not port all the new APIs back to automation interfaces (capable of being called by automation clients, such as VB). However, if you wanted to, you (or someone on your team that's familiar with COM development) could bridge between the COM interfaces and the automation interfaces, and provide that functionality in your own custom automation interfaces. Again, if you have any questions on how to do that, I'd be happy to help. Just send me an email (lis...@microsoft.com) and I'll get your question addressed.
I'm curious about the search and replace comments. Maybe we can follow up offline? I'd like to better understand what you'd like to do that you can't seem to do with Vista. I think we have a way for you to do just what you're looking for, but I'm not 100% sure I understand your scenarios.
--
Rob Chambers [MSFT]
http://blogs.msdn.com/robch/default.aspx
Architect - Windows Speech Recognition - We're Listening...
This posting is provided "AS IS" with no warranties, and confers no rights.
Hi Rob, I have tried to
Hi Rob,
I have tried to check the mis-recognitions in a little more detail. Recognition of words in the main vocabulary seems to be about as accurate as it was with the previous beta build. Recognition of new words that were added to the user vocabulary previously was excellent. The words were consistently recognized properly after the initial training which was very impressive performance. Current tests with two of the same two words, dyspnea and arrhythmia, show dyspnea to be correctly recognized about 30% of the time and arrhythmia is never correctly recognized. This is a very major deterioration in performance.
The hardware is unchanged, NaturallySpeaking runs fine on the same system and the audio sounds nice and clear.
There are some differences in audio behavior. Vista previously required considerably more audio gain to obtain a mid VU reading compared with NaturallySpeaking. Vista gain settings are now similar to those of NS. One point of confusion is that there are major differences between the Vista microphone icon VU display and Vista's other two VU displays.
I would be interested to know what has changed in the audio signal handling. Also, Vista Speech apparently doesn't set any sample rate and word size defaults.
There is an interesting bug in the user vocabulary. I wanted to start with fresh training and deleted all the words in the vocabulary. I exited the program, re-executed it and to my surprise the original word list re-appeared. Deleting only one word from a list that contains more than one word doesn't have this problem.
Finally, I would again like to express my concern that there is no way to re-train a word in the main vocabulary. Words in the user vocabulary cannot be re-trained from the pop-up correction window. You have to go back to the vocabulary editor which is a very cumbersome process.
Robbie
Hi Robbie, Are you trying to
Hi Robbie,
Are you trying to add a large number of user vocabulary words? How many? You may actually be better suited making a new vocabulary. That's not something that's possible yet, but should be by next year...
We didn't actually change how much gain was required. We did, however, change how we displayed our VU meter. It's now a bit more logrithmic.
I say a bit, because it's not exactly that. The core engine is the same, and it's required gain settings haven't changed, but we did change what the UI looked like so people wouldn't try to "over adjust" the gain from what the MS SR engine wants.
I haven't personally heard of the vocabulary bug you mention. I'll send that along to the dev team right away.
Once we ship our macro system (hopefully in early 2007), you could have a macro (or perhaps we'll ship one by default with the macro system) to allow you to re-train from the selection in your document. I've written a similar macro (but it's a word macro) and posted it on the yahoo speech group. A similar type of macro could be written easily in our new speech macro language.
--
Rob Chambers [MSFT]
http://blogs.msdn.com/robch/default.aspx
Architect - Windows Speech Recognition - We're Listening...
This posting is provided "AS IS" with no warranties, and confers no rights.