Selecting the best ASR option for your company is an important decision. While the bulk of this article is an educational piece on how to most effectively test for ASR providers, the first step when making an important buying decision is identifying your priorities:
What do you want?
What do you need?
What doesn't matter?
Besides what was discussed in the tip sheet, How to Vet an Automatic Speech Recognition Solution Provider?, here are some other features you should also consider:
Speed - How fast is your batch transcription? How fast is your real-time transcription?
Multi-channel support - Do you support multi-channel audio or only single channel?
Speaker separation - Can you separate the speakers; i.e. speaker 1 or speaker 2, even for non-stereo applications?
Deep search - Can you search the audio for find specific words or phrases to listen and review; not search the transcription?
Getting a sense of what features your company might need before starting talks with providers will help you avoid the common trap of relying purely on accuracy rate. Otherwise, you'll likely find yourself having this conversation:
Buyer: "We've been looking at a couple ASR providers...what's your accuracy rate?" ASR Provider: "Fantastic. On an academic data set that is publicly known, we claim a 95% accuracy rate." Buyer: "That sounds great! But how does that relate to our audio data?" ASR Provider:"Trust me, we'll do great on that too!" Buyer: "Hmm..."
For a long time, ASR companies have avoided doing real comparisons on company specific audio data by focusing marketing dollars and sales narratives on impressive outcomes from public datasets. (Like new advances on word error rates). By distracting companies from the fact that gamed success statistics don't translate to real world applications, ASR providers have been able to trick companies into buying a car without test driving it first. So, what is the best way to actually test drive and walk away with a great deal?
How to test an ASR Solution?
With the goal of getting the truth and investing as little effort as possible, here are optimal guidelines for testing speech recognition providers in an apples to apples accuracy comparison:
Step 1: Select 50 randomly sampled audio files that are representative of the audio your company encounters
Use meeting recordings if your goal is to transcribe meetings
Use voicemails if the goal is to transcribe voicemails
Use audio with accents if your audio will have speakers with accents
Record yourself talking into your computer
Use a random YouTube video
Test out your favorite podcast or broadcast audio
Use a song
Pay humans to transcribe one minute from each of these files. This effort should cost $100 or less and will serve as the truth of all truths for all the ASR providers you'll be comparing. You can easily find transcriptionists using Rev.com or Upwork.
Step 2: Send the same 50 one-minute clips to each of the speech vendors that you are considering to test the output of their APIs
Take note of what each provider deems an acceptable audio format and how it fits into your list of considerations from earlier.
Step 3: Receive the text outputs and normalize them for the "choices" that an ASR company makes with their out-of-the-box transcripts
How are phone numbers transcribed?
nine zero five six seven eight one two three four?
9 0 5 6 7 8 1 2 3 4?
Are outputs punctuated and capitalized?
Step 4: Do a Word Error Rate (WER) comparison on the files
Word Error Rate (WER) will give you a sense of the overall accuracy of the transcripts, but don't just look at the number. You also want to look at where the output was wrong and why the output was wrong. This includes what words were incorrectly added, omitted, or simply misinterpreted.
Step 5: Make a visual representation of what was wrong
Who is getting the important words right vs. wrong?
Whose outputs are the most legible?
At this point, you will know where each ASR provider stands from a general accuracy perspective on audio representative of your use case. Next, consider the other variables in How to Vet an Automatic Speech Recognition Solution Provider? and drill down on accuracy improvements with keywords, libraries, reprogramming, and custom training. With a good handle on where each competitor stands in terms all these variables, you can confidently go into pricing conversations and make better decisions for your business. If you're ready to compare Deepgram's AI Speech Platform with other ASR providers, contact us.
If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions.
More with these tags:
Share your feedback
Was this article useful or interesting to you?
We appreciate your response.
Why Deepgram Built its Platform in Rust
- Adam Sypniewski
Exploring OpenAI Whisper Speech Recognition
- Julia Strout
Try Whisper: OpenAI's Speech Recognition Model in 1 Minute
- Michael Jolley
Use OpenAI Whisper Speech Recognition with the Deepgram API
- Scott Stephenson