Selecting the best ASR option for your company is an important decision. While the bulk of this article is an educational piece on how to most effectively test for ASR accuracy, the first step when making an important buying decision is identifying your priorities:
- What do you want?
- What do you need?
- What doesn’t matter?
Typical considerations include weighing the strengths and weaknesses of
- Ability to support custom models and vocabulary
- Multi-channel support
- Speaker separation
- Deep search
- And more
Getting a sense of what features your company might need before starting talks with providers will help you avoid the common trap of relying purely on accuracy rate. Otherwise, you’ll likely find yourself having this conversation:
Buyer: “We’ve been looking at a couple ASR providers...what’s your accuracy rate?”
ASR Provider: “Fantastic. On an academic data set that is publicly known, we claim a 95% accuracy rate.”
Buyer: “That sounds great! But how does that relate to our audio data?”
ASR Provider:“Trust me, we’ll do great on that too!”
Hiding behind the numbers
For a long time, ASR companies have avoided doing real comparisons on company specific audio data by focusing marketing dollars and sales narratives on impressive outcomes from public datasets. (Like new advances on word error rates).
By distracting companies from the fact that gamed success statistics don’t translate to real world applications, ASR providers have been able to trick companies into buying a car without test driving it first.
So, what is the best way to actually test drive and walk away with a great deal?
Getting the truth
With the goal of getting the truth and investing as little effort as possible, here are optimal guidelines for testing speech recognition providers in an apples to apples accuracy comparison:
Select 50 randomly sampled audio files that are representative of the audio your company encounters.
- Use meeting recordings if your goal is to transcribe meetings
- Use voicemails if the goal is to transcribe voicemails
- Use audio with accents if your audio will have speakers with accents
- Record yourself talking into your computer
- Use a random YouTube video
- Test out your favorite podcast or broadcast audio
- Use a song
Pay humans to transcribe one minute from each of these files. This effort should cost $100 or less and will serve as the truth of all truths for all the ASR providers you'll be comparing.
(You can easily find transcriptionists using Rev.com or Upwork)
Send the same 50 one-minute clips to each of the speech vendors that you are considering to test the output of their APIs.
(Take note of what each provider deems an acceptable audio format and how it fits into your list of considerations from earlier.)
Receive the text outputs and normalize them for the “choices” that an ASR company makes with their out of the box transcripts.
- How are phone numbers transcribed?
- nine zero five six seven eight one two three four?
- 9 0 5 6 7 8 1 2 3 4?
- Are outputs punctuated and capitalized?
- How are phone numbers transcribed?
Do a Word Error Rate (WER) comparison on the files. Don’t just look at the number, look at where the output was wrong and why the output was wrong. This includes what words were incorrectly added, omitted, or simply misinterpreted.
Make a visual representation of what was wrong.
- Who is getting the important words right vs. wrong?
- Whose outputs are the most legible?
Make your move
At this point, you will know where each ASR provider stands from an accuracy perspective on audio representative of your use case. Next, consider the pricing structure and additional capabilities that might be needed in addition to baseline accuracy.
With a good handle on where each competitor stands in terms of accuracy, you can confidently go into pricing conversations and make better decisions for your business.
If you're ready to test out Deepgram's ASR solution, contact firstname.lastname@example.org.