Speech Recognition

Speech Recognition is a method of converting voice data into text. It is also very often called Speech to Text, or (better) Speech Transcription.

Normally, an audio-recording is fed to a speech recognition engine which then tries to transcribe the audio into its text representation.

Humans reach a speech recognition rate of around 95%. This is also the measure against all speech recognition engines are measured. There is the term WRR (Word-Recognition Rate) or WER (Word-Error Rate).

The WRR shows how accurate the recognition is (remember, humans reach around 95%).

The WER shows how many errors the engine makes (1-WRR = WER; 1-WRR = WER).

Most traditional speech recognition engines are based on word-recognition, i.e., they try to recognize complete words.

At M-AILABS, we have developed a different engine: it is based on character-recognition instead of word recognition and composes the words by adding characters together.

This has the benefit of having a higher granularity and thus higher probability of correct recognition.

It also has the major benefit of allowing for using spell-checking after initial recognition (including other methods such as stemming, etc.) to complete the final recognition step.

Since traditional engines are word-based and since spell-checkers check whether a word is written correctly, and since all the words of a traditional recognition engine are already spelled correctly there is no way of adding benefit by using a spell-checker.

In our case, by using (as a final step) a spell-checker we were able to massively increase the recognition rate.

Currently, initial experiments show a WRR of close to 98% (2 – 2.5% WER) when spell-cheking is performed. Raw recognition rate is at around 90-92%, which we increase using the spell-checker.

M-AILABS is currently experimenting to analyze grammar and provide variations of transcribed text with variation-specific confidence-levels.

Our engine is based on the paper of Deep Speech II by Baidu Research – Silicon Valley AI Lab, which (as of now) is the most state-of-the-art technology.

Here are some samples during training of English language Speech Recognition (very early stages):

Original Utterance (66): He was quite peremptory both in look and voice The chill of Missis
      Decoded Text (62): I was quite e ontery both en book can vos the chill in Missis

Original Utterance (83): I knew I was in a small room and in a narrow bed To that bed I seemed to have grown
      Decoded Text (82): I knew I was in a small room and in a narrow bed to that bed I seeme to have grown

Original Utterance (39): You may thus study them at your leisure
      Decoded Text (38): You may thes thudy them at yer leisure

Original Utterance (81): Hoo doesnt comprehend the Union for all that Its a great power its our only power
      Decoded Text (81): Hho doesnt comprehend the Union for all that Its a great power its our only power

Original Utterance (78): No God be praised he slowly answered Alice lives to bless some good man's life
      Decoded Text (78): No God be praised he slowly answered Alice Lives to bless some good man's life

Original Utterance (16): Chapter eighteen
      Decoded Text (16): Chapter eighteen

Original Utterance (18): I'm not a polliwog
      Decoded Text (17): I'm not a pollyog

Original Utterance (38): I'm chief cook for that old horror Zog
      Decoded Text (35): I' chief cook o that oll horrers az

Original Utterance (132): In an instant there was a flash of light within and then the dimly outlined shadows of a woman moving from behind the linen curtains
      Decoded Text (132): In an instant there was a flash of light within and then the dimly outlineshadows of a woman moving from behind the linneng curtains

Original Utterance (109): Margaret could not sit still It was a relief to her to aid Dixon in all her preparations for Master Frederick
      Decoded Text (109): Margaret could not sit still it was a relief to her to aid Dixon in all her preparations for Master Frederick

As you can see, there are some entries with major errors. This is because some of the test-data has heavy noise and others less so. The engine was still training. The really bad results are based on audio with so much background-noise that even a human has problems understanding what is said there.

In such a case, human WER is similar to what the engine shows in these early stages. Since the engine has only one chance to “listen” to the audio, we can compare it only to a human-result where a human has listened only once to the audio-file, too.

Please contact us if you need further information.