In ASR, familiarity with vocabulary matters
The worlds of Automatic Speech Recognition (ASR) and human language learning are strikingly similar. How effectively an ASR system transcribes a conversation strongly depends on how familiar that ASR is with the words being used.
Try to think back to a time when you were 8 or 9 years old, sitting at the table of a large family gathering listening to the adults talk about adult stuff. More than likely, you listened for a while, then began to get restless, and decided to pick on your kid brother, or tug at someone’s sleeve. In my case, I broke stuff.
If you have kids, you don’t need to think back to this. Your kids likely begin carrying on 7 minutes after the adults get to talking about real issues: politics, science and Bertrand Russel’s take on metaphysics.
Why do children get bored at these conversations? After all, humans are, by definition, the best speech recognition systems we have. And, 7, 8, 9-year-old children certainly are native speakers of their own language(s). Are kids simply not interested in adult stuff? No. In fact, kids just want to be adults -and fast!
It’s not the grammar or language model, it's just knowing the words!
The answer in great part has to do with the fact that humans, like ASR systems, need to be familiar with the content words of a conversation to understand it. One of the biggest differences between the language of children and that of adults is not found in the grammar, but in the technical (specialized) nature of the words adults use. Children, even precocious ones, just don’t know these words; they can’t even remember having heard them until years later when they read them in high school or college. At that point, there’s a chance they still wouldn’t know what they meant.
Expecting an “off-the-shelf” ASR system to provide accurate transcription of your audio data, is about as logical as expecting an 8 year-old to do the same. In fact, we can carry the analogy further. A quick internet search for interpreters and human transcriptionists will show that these professionals have specializations that allow them to work with particular industries and professions — medical, legal, insurance and so on. These transcription and translation professionals work hard to keep-up-to-date with the jargon and language of their specialization areas. They also know that their specialized knowledge is a unique skill set, and charge accordingly.
If you can, go custom
The beauty of getting a custom ASR model trained on your audio data is that you get significant increases in accuracy over the standard models available today. In terms of our analogy, this is like hiring a college student with a major in your field to do your transcription instead of the eight-year-old.
When considering an ASR provider, find out what data they’ve used to train their speech models and what customization steps they offer. While a large sample of the world’s data might sound impressive, it’s not nearly as helpful as teaching an ASR system to specialize in your data. Also, don’t hire 8-year-olds to do transcription.