Preparing Training Data

One of the major tasks in training machine learning (if not the major-task) is preparing the training data. This is a tedious task that requires people who like repetitive work. Though it is tedious, it is (in our opinion) even more important than knowing which model to use, what software to write to train the system.

M-AILABS has drawn a diagram about the process of machine learning/artificial intelligence. We believe that the Data Labeler and Data Scientist will be more important in the future than the actual Algorithm Designer.

In fact, the Data Labeler will be crucial for the next 20-50 years.

This task requires concentration, understanding and the ability to transfer knowledge to information and then to data.

When we started generating large amounts of training data for our speech recognition and speech synthesis software, we realized that it is not just getting the audio files and “somehow preparing them”.

The task involved:

  1. Removing any jingles
  2. Splitting the large audio files into smaller sets of max 20 seconds: this requires knowing where to split (specifically at pauses)
  3. Retrieving the actual text of the audio file (usually as a large RTF or .TXT)
  4. Assigning the actual text-snippets to the appropriate audio files.
  5. QA

Now, how do we assign the actual text-snippets to the correct audio files? There are multiple ways, either:

  1. listen to each audio file and copy the snippet from the text and assign it (in something like a CSV-file) or
  2. “somehow” transcribe the audio-files to text snippets (requires a transcription engine)

We decided to go the “second path”. We first transcribed the audio-files to text-files using a system that we have. The result is really, really bad but it is good enough to use our QA-Tool to then select the correct text from the original RTF- or TXT-file.

With each correct transcription we can increase the accuracy of our initial transcription-engine and more and more automate the QA-Tool.

Tools, tools, tools

Over the time we have written so many tools for data preparation and conversion that I have lost count. It will happen to you, too. The problem is that these tools also require understanding of machine learning techniques, so you can’t do without a broad knowledge of all types of machine learning. Especially for Speech Recognition, you will need to understand NLP, including similarity analysis and more.

Our tool, TRQA, (short for ‘TRanscription Quality Assurance”) helps the Data Labeler. It is written only for the console (as GUIs are too slow and cumbersome for these kind of works) and tries to automatically QA transcriptions as much as possible. If it is not sure, it will ask the user:

You can see it is really easy to use.

The actual work happens towards the bottom lines. When the tool is not sure about a transcription, it will show the “automatically transcribed” text in blue, a separator (|) and then the next text in the transcription list (in yellow). On the next line the Data Labeler sees the recommendation from the original text (in green), a separator (|) and then how the original text continues afterwards.

The Data Labeler can then add/include words to the green text by pushing “+” or remove/exclude words using “-“. When the Data Labeler thinks that the selected (green) text represents the transcription, he/she will hit “ENTER” and the tool will go to the next transcription.

Sometimes there are transcriptions that don’t show up in the original text (such as intro/outro or comments within audio but that don’t exist in original TXT/RTF). In these cases the Data Labeler can just keep the transcription as-is. Also, sometimes there are texts in the original TXT/RTF that do not show up as audio (mostly in interviews). In these cases the Data Labeler can just skip those words.

At any given time, the Data Labeler can quit the tool. If the current transcription QA was not completed and the tool is started again, the tool will ask the Data Labeler whether he/she wants to continue the last task or start a new QA task. In case the Data Labeler chooses to start a new QA task, it doesn’t mean all was lost for the previous task. If the Data Labeler switches back to the previous task, the tool will recognize automatically that it was not completed and continue that task instead of starting from scratch.

If the Data Labeler made a mistake, he/she can always undo it. The undo level is up to the beginning of the transcription QA task…


We had to make the tool as user-friendly as possible. About 60% of programming with this tool was investment in user-friendliness. It begins with showing as much information as possible but as little as required. Overloading the screen would confuse the Data Labeler.

Additionally, all commands are single key-strokes. There is no need for “CTRL”, or “Command” or such. Everything is a single key-stroke.

Another important learning: choose your keys carefully. “Dangerous” keys should be far away from each other on the keyboard. It is ok to use “+” and “-” (they are close to each other), because these are not dangerous. But we rather chose “K” for keeping the auto-transcription and “S” for skipping words from the book instead of “I” (for ignore) for this skipping. “K” and “I” are just too close to each other on German/US-keyboards.

Every key-stroke has immediate visual effect. When the user hits “+” or “-“, he/she will see the marker (|) moving immediately. The Data Labeler has always an overview of the last n text he/she transcribed in order to lookup the current context.

Lastly, we are showing a progress bar (always good psychologically) and show the file-name of the audio-file currently in QA. In the worst case scenario, the Data Labeler can even listen to the audio if the transcription is too horrible to actually know what to choose. This last part never happened so far, but you never know.

Of course, there always needs to be a Help window.

We invested also a lot of time (using curses) to make the UI as nice as you can on a text-based screen. The selection of the QA task happens with OS’s GUI-standard-file-selector.

There are more smaller learnings – but they are all about making the life of the Data Labeler as easy as possible. The actual logic in the tool didn’t have too many changes after initial development. But the UI and UX got better and better. And we are still investing in the UI and UX…


The Data Labeler has actually a better future prospect than the Engine Designer/Developer. Since the training data needs to be generated and labeled, this is a job for the next 50-100 years (at least).

But it is important to make his/her life as easy as possible. Otherwise the resulting data will contain bugs and errors. Training a system with erroneous data results in … wrong training.

So, please: invest in your tools for your Data Labelers. Invest in these tools more than in your engines. That is where the know-how is…