Performance improvement in Nuance SDK 9.0
Hello Everybody,
I am new to Nuance SDK. Trying to learn the strings. We have Nuance SDK 9.0 medical software installed. I am trying to create profiles (enroll users) and improve the recognition rate of the speech engine. I am using the standard tools provided along with the Nuance SDK. The installation is a "Thin Client Version" meaning I cannot use DragonCorrectionObject.
To improve the performance of the speech engine, we can adapt the acoustic model and the language model. I created a profile using 5 hours of speech from a speaker. This was the initial profile. The documentation says as and when more data speech/text is given the engine will learn and improve its performance.
Now to improve the acoustic model this is what I did:
I have a speech and its corresponding transcription. I used the profile and transcribed all the speech files and generated nuance .DRA files (binary file with transcription and wav). Since I already had the correct transcription (near correct, it is not verbatim of what is in the speech file) I used it correct the acoustic model using the "batch correction" utility. MY UNDERSTANDING is that the "batch correction" utility corrects the DRA files and then updates the speaker profile. I do so for a large number of files (300 odd, approximately 75,000 words). This I though was the supervised learning of the engine. Hoping that the profile is now better I test the profile again for the same set of files used originally. I find that there is no difference in number of words recognized correctly.
Nuance also provides the "efenroll" utility to train a profile. I used the initial base profile and then trained using a more data up to 10 hours more data (speech+text pairs). In this case too I found nil improvement in the performance of the profile.
Why is there no improvement in the performance?
Is the above procedure wrong?
Is the tool used wrong?
What should I do to improve the acoustic profile?
Any suggestions will be very helpful.



an_k3 wrote: Hello
Hello Everybody,
I am new to Nuance SDK. Trying to learn the strings. We have Nuance SDK 9.0 medical software installed. I am trying to create profiles (enroll users) and improve the recognition rate of the speech engine. I am using the standard tools provided along with the Nuance SDK. The installation is a "Thin Client Version" meaning I cannot use DragonCorrectionObject.
To improve the performance of the speech engine, we can adapt the acoustic model and the language model. I created a profile using 5 hours of speech from a speaker. This was the initial profile. The documentation says as and when more data speech/text is given the engine will learn and improve its performance.
Now to improve the acoustic model this is what I did:
I have a speech and its corresponding transcription. I used the profile and transcribed all the speech files and generated nuance .DRA files (binary file with transcription and wav). Since I already had the correct transcription (near correct, it is not verbatim of what is in the speech file) I used it correct the acoustic model using the "batch correction" utility. MY UNDERSTANDING is that the "batch correction" utility corrects the DRA files and then updates the speaker profile. I do so for a large number of files (300 odd, approximately 75,000 words). This I though was the supervised learning of the engine. Hoping that the profile is now better I test the profile again for the same set of files used originally. I find that there is no difference in number of words recognized correctly.
Nuance also provides the "efenroll" utility to train a profile. I used the initial base profile and then trained using a more data up to 10 hours more data (speech+text pairs). In this case too I found nil improvement in the performance of the profile.
Why is there no improvement in the performance?
Is the above procedure wrong?
Is the tool used wrong?
What should I do to improve the acoustic profile?
Any suggestions will be very helpful.
First of all, this is an end-user forum. You will not get any support or help on this or any of the other end-user forums with regard to using the SDK. Generally, when you purchase the SDK, you should also be purchasing a support package. Only Nuance developers can help you specifically with problems related to the SDK itself. I did the conversion of L&H Voice Xpress and Dragon NaturallySpeaking SDKs for versions 5 and 6, but I have not worked directly with the SDK for some time. Although I understand the features that your addressing, I need a clearer explanation of what you're doing in order to assist you.
Second, you need to be clear on exactly what you're doing. I may be able to help you, but your descriptions are far too general to be of any real assistance to you in this regard. Your general descriptions are far too vague for me to be able to completely understand what it is that you're doing. I need a step-by-step explanation in technical terms.
Third, you're overtraining. Overtraining will not improve the performance (accuracy) of DNS. In fact, in many cases it has exactly the opposite effect. If you are adding files to be analyzed, you are not training the Acoustic Model you're training the Language Models.
If you can be more specific (step-by-step) in terms of describing exactly what you're doing technically, I may be able to help you understand what's going on and why you're not getting any performance improvement. However, I would still say that you're overtraining.
Chuck Runquist
Former Dragon NaturallySpeaking SDK & Senior Technical Solutions PM for DNS
If the answer is wrong it is because the question was wrong. Unknown - ancient Chinese saying
Chuck, I am new to the SDK
Chuck,
I am new to the SDK team. The SDK support contract expired. Not yet renewed it. Hence the post here. Sorry for the trouble and the long post.
I will try to be more clear in the description of the problem.
Objective: To create a profile of a user and improve its accuracy to the maximum possible extent.
Sorry if I am repeating some of the basic steps that you may know already. This is just to clarify the thought process behind the approach adopted.
Tools used: DNS 9.0 tools supplied along with the installation.
1. Creating a profile:
Tool used efenroll.exe
About 5 hours of TXT & WAV file pairs is required to create a profile. The TXT files I have, are not the exact verbatim transcription of the speech files. The efenroll utility does not require it to be so. It creates a first pass transcription of the speech file and uses it as the transcription of the speech file supplied, by some internal mechanism. The text files supplied while training are used to update the language model.
I collected the required data (5 hours) and suppiled it to the efenroll.exe tool. The profile was created.
I tested the profile with some say, 25 speech files (test set). The performance of the profile generated in the first step is poor. The goal is now to improve the preformance of the profile. Two options exist. To improve the acoustic model and the language model.
2. Improving the Acoustic Model.
Tool used batchcorrection.exe, complex.exe,
Acousic model performance can be improved using the thin client or thick client version of DNS installation. In the thick client version DNSCorrectionObject is used to correct any mistakes in the transcription. In the thin client version some external editor is used to correct the transcription. So the corrections made in the transcription is not fed back to engine. I use a thin client version.
2.(a) Adaptation process: for the thin client
Quote documentation
"you can use the corrected transcript to update the speaker's user profile. Updating the speaker's profile increases Dragon's accuracy for that speaker's subsequent transcriptions."
"your application can update a user's profile by re-running EFEnroll."
"After the transcriptionist has corrected the transcription and the speaker has approved it, you update the speaker's user profile using the EFEnroll methods and the new .WAV/.TXT file pair that was returned by the correction client"
So I understand that to adapt the profile we just need to rerun the efenroll.exe with corrected transcript and WAV file. Since the WAV & TXT file pairs were correct in the first place, I can just give it again for retraining. IS THAT CORRECT?
Alternatively I PRESUMED that the requirement is a corrected WAV & TXT pair of files. So instead of giving the same old test set WAV & TXT pairs, I collected a new set of WAV & TXT pairs and used the profile generated in step 1, and gave it to the efenroll.exe utility. I expect an improved performance (accuray). IS THIS ASSUMPTION CORRECT?
The performance of this updated profile for the same set of 25 test files was exactly the same. I retrained the profile incrementally with 1 hour of WAV & TXT pairs going upto 10 hours and more. Still there was nil improvement in the accuracy. SO WHAT IS THE MISTAKE THAT I AM DOING?
I though that the engine is not learing where the errors are and is making the same mistakes again.
Issue: So how to make the engine know where it has made the mistake, so that it can learn. That is, how to let the engine know for each time segement of the speech file the correct word that it corresponds to.
Procedure adopted.
2.(b) Adaptation- like a- thick client
Documentation:
"Using a thick correction client allows you to correct draft transcripts and adapt user profiles with those corrections easily. However, using a thick client means that your transcription will be contained in Dragon's condordance file, called a .DRA file,"
"After the transcriptionist has corrected the draft transcription and the speaker has reviewed and accepted the corrected transcript, you can use the corrected transcript to update the speaker's user profile. Updating the speaker's profile increases Dragon's accuracy for that speaker's subsequent transcriptions.
"In this step of the transcription workflow, the workflow component of your transcription workflow application sends the corrected text, the speaker's user profile, the original wave data, and timing information to the adaptation server. The adaptation component takes this information and updates the speaker's profile. The necessary .WAV data, text, and timing information is contained in a .DRA file if your system uses thick correction clients, and in separate .WAV, .TXT, and .IDX files if your system uses thin correction clients."
From this I understand that the DRA file has the transcription, speech and timing infromation. Since I use a thin client I do not have a .DRA file. I need to create the DRA file. So I used the "complex.exe" utility supplied.
Description of the tool "complex -- The complex sample shows how to use the DgnDictCustom object to transcribe text from a .WAV file. It also demonstrates how to play back the speech data that is captured during dictation, and how to save and load the speech data from different dictation sessions"
The utility gives a .TXT file and a .DRA file as output.
I used this tool to create .DRA files for 1 hour of speech. So I now had the speech file, the TXT file and the DRA file. Since I have the DRA file then it is like I am now working with a thick client using DragonCorrectionObject. Whatever correction I make, it will automatically be updated into the DRA file. The DRA file can then be used for adaptation.
Problem: I still do not have a tool implementing DragonCorrectionObject. So how to correct the transcription at the correct places.
A tool called batchcorrection.exe is supplied.
Documentation:
"With the Batch Correction command-line utility, you can compare transcribed text to the .DRA file created during transcription. The Batch Correction utility automatically corrects the dictated text stored in a .DRA file by comparing it to a reference text stored in a plain text file. If corrections or new words (words not in the current vocabulary) are found, Batch Correction will update the speaker's profile (the speaker's vocabulary and the acoustics.) For more information on .DRA files, see Adaptation Overview.
The Batch Correction utility does the following:
1. Loads a speaker (or uses the currently loaded speaker).
2. Loads a specified .DRA file.
3. Loads the corrected reference text.
4. Compare the text contained in the .DRA file with the specified reference text and finds any differences between the two.
5. When a difference is found, it replaces the selected string in the the .DRA file using the QuickCorrect method. Revisions are not updated.
6. When all error regions have been handled, Batch Correction updates the speaker's profile (the vocabulary and acoustics) with the corrections and any new words it finds."
THE POINT I CONSIDER IMPORTANT IS POINT NO 6. "updates the speaker's profile (the vocabulary and acoustics) with the corrections"
WHAT DOES "Revisions are not updated" IN POINT 5 ABOVE MEAN?
Strategy: Speech file is available. Use the complex.exe tool on it. This will create a DRA file and a TXT file. Ignore the TXT file since I already have a nearly correct TXT file (not verbatim) of speech file. Now use the batchcorrection.exe tool. Supply it the .DRA file and the nearly correct TXT file. The batch correction will correct the DRA file and also update the speaker vocabulary and the acoustics. By which I understand the acoustic profile will be updated. IS THAT CORRECT?
I perform the above procedure with many files. I assume that since the supplied TXT file is not verbatim correct there may be some mistakes. However many corrections will be made. The tool does say 40 corrections found and corrected, 5 failed or some such message.
Happy, now I retest the newly updated profile with the same set of 25 files.
ALAS! no change in word accuracy!
So this is in detail what I tried to do using the tools supplied.
Is the logic used correct?
Is there some where I am going wrong?
If this is not the correct way to do it, is there some other procedure that can be adopted?
Thanks in advance for your suggestions and assistance.