Speech Recognition is a method of converting voice data into text. It is also very often called Speech to Text, or (better) Speech Transcription.
Normally, an audio-recording is fed to a speech recognition engine which then tries to transcribe the audio into its text representation.
Humans reach a speech recognition rate of around 95%. This is also the measure against all speech recognition engines are measured. There is the term
WRR (Word-Recognition Rate) or
WER (Word-Error Rate).
WRR shows how accurate the recognition is (remember, humans reach around 95%).
WER shows how many errors the engine makes (
1-WRR = WER; 1-WRR = WER).
Most traditional speech recognition engines are based on word-recognition, i.e., they try to recognize complete words.
At M-AILABS, we have developed a different engine: it is based on character-recognition instead of word recognition and composes the words by adding characters together.
This has the benefit of having a higher granularity and thus higher probability of correct recognition.
It also has the major benefit of allowing for using spell-checking after initial recognition (including other methods such as stemming, etc.) to complete the final recognition step.
Since traditional engines are word-based and since spell-checkers check whether a word is written correctly, and since all the words of a traditional recognition engine are already spelled correctly there is no way of adding benefit by using a spell-checker.
In our case, by using (as a final step) a spell-checker we were able to massively increase the recognition rate.
Currently, initial experiments show a
WRR of close to 98% (2 – 2.5%
WER) when spell-cheking is performed. Raw recognition rate is at around 90-92%, which we increase using the spell-checker.
M-AILABS is currently experimenting to analyze grammar and provide variations of transcribed text with variation-specific confidence-levels.
Our engine is based on the paper of Deep Speech II by Baidu Research – Silicon Valley AI Lab, which (as of now) is the most state-of-the-art technology.
Here are some samples during training of English language Speech Recognition (very early stages):
Original Utterance (66): He was quite peremptory both in look and voice The chill of Missis Decoded Text (62): I was quite e ontery both en book can vos the chill in Missis Original Utterance (83): I knew I was in a small room and in a narrow bed To that bed I seemed to have grown Decoded Text (82): I knew I was in a small room and in a narrow bed to that bed I seeme to have grown Original Utterance (39): You may thus study them at your leisure Decoded Text (38): You may thes thudy them at yer leisure Original Utterance (81): Hoo doesnt comprehend the Union for all that Its a great power its our only power Decoded Text (81): Hho doesnt comprehend the Union for all that Its a great power its our only power Original Utterance (78): No God be praised he slowly answered Alice lives to bless some good man's life Decoded Text (78): No God be praised he slowly answered Alice Lives to bless some good man's life Original Utterance (16): Chapter eighteen Decoded Text (16): Chapter eighteen Original Utterance (18): I'm not a polliwog Decoded Text (17): I'm not a pollyog Original Utterance (38): I'm chief cook for that old horror Zog Decoded Text (35): I' chief cook o that oll horrers az Original Utterance (132): In an instant there was a flash of light within and then the dimly outlined shadows of a woman moving from behind the linen curtains Decoded Text (132): In an instant there was a flash of light within and then the dimly outlineshadows of a woman moving from behind the linneng curtains Original Utterance (109): Margaret could not sit still It was a relief to her to aid Dixon in all her preparations for Master Frederick Decoded Text (109): Margaret could not sit still it was a relief to her to aid Dixon in all her preparations for Master Frederick
As you can see, there are some entries with major errors. This is because some of the test-data has heavy noise and others less so. The engine was still training. The really bad results are based on audio with so much background-noise that even a human has problems understanding what is said there.
In such a case, human
WER is similar to what the engine shows in these early stages. Since the engine has only one chance to “listen” to the audio, we can compare it only to a human-result where a human has listened only once to the audio-file, too.
Please contact us if you need further information.