Resources Article Benchmarking OpenAI Whisper for non-English ASR

Benchmarking OpenAI Whisper for non-English ASR

Dan Shafer

Published on 11/04/22Updated on 10/11/23

Table of Contents

Measuring the Accuracy of an ASR Model Benchmarking ASR for Non-English Languages Benchmarking Whisper Whisper Wrap-Up

Share this guide

You may be wondering: Can a single deep learning model—granted, a large one—really achieve accurate and robust automatic speech recognition (ASR) across many languages? Sure, why not? In this post, we will discuss benchmarking OpenAI Whisper models for non-English ASR.

We'll first go over some basics. How does one measure the accuracy of an ASR model? We'll work through a simple example. We will then discuss some of the challenges of accurate benchmarks, especially for non-English languages. Finally, we'll choose a fun mix of languages we are familiar with here at Deepgram—Spanish, French, German, Hindi, and Turkish—and benchmark Whisper for those, using curated publicly available data we have labeled in-house.

Measuring the Accuracy of an ASR Model

Benchmarking ASR, for English or otherwise, is simple in concept but tricky in practice. Why? Well, to start: a lot comes down to how you normalize (standardize) the text, which otherwise will differ between your model, another model, and whatever labels you consider to be the ground truth.

Let's look at a simple example in English. Say your ground truth is:

My favorite city is Paris, France. I’ve been there 12 times.

And perhaps an ASR model predicts:

my favorite city is paris france ive been there twelve times

This might be the sort of output you would get if you never intended your model to predict capitalization or punctuation, just words.

Considering that, the model does really well, right? In fact, it's perfect! To verify this, we will compute the word error rate (WER), defined as the total number of mistakes (insertions, deletions, or replacements of words) divided by the total number of words in the ground truth. A value of zero means the prediction was spot on. Typically, the WER is less than 1, though you would find a value of exactly 1 if no words were predicted and an arbitrarily large value if many more words are predicted than are in the ground truth.

Let's directly compute the WER for our example. To get setup, we make sure we have installed the editdistance package. It will enable us to compute the minimum number of insertions, deletions, and replacements to make any two sequences match.

pip install editdistance

import re
import editdistance

# Define a helper function to computer the WER, given an arbitrary function that converts the text to word tokens

def _wer(truth, pred, tokenizer):
   truth_tokens = tokenizer(truth)
   pred_tokens = tokenizer(pred)
   return editdistance.eval(truth_tokens, pred_tokens) / len(truth_tokens)

# Store a hypothetical ground truth and a model's prediction

truth = "My favorite city is Paris, France. I've been there 12 times."
pred = "my favorite city is paris france ive been there twelve times"

# Compute and display the WER obtained after simply splitting the text on whitespace

wer = _wer(truth, pred, str.split)
print(f'WER: {wer}')

You will find that the WER is 0.55! Hey! What happened? The word tokens must be identical strings to not count as an error: "I've" is not the same as "ive", and so on. Clearly, we need to ignore punctuation and capitalization when computing the WER. So we can normalize by removing both. Let's try it.

# Define a helper function to lowercase the text and strip common punctuation characters, using a regex substitution, before splitting on whitespace

def _normalize(text):
   text = text.lower()
   text = re.sub(r'[\.\?\!\,\']', '', text)
   return text.split()

wer = _wer(truth, pred, _normalize)
print(f'WER: {wer}')

Now we find a WER value of 0.09, which is much better! If you think through the example, you will notice one remaining issue: the numeral "12" vs. the word "twelve". In some applications, we may actually want to consider that an error (e.g. maybe you really need your model to produce a numeral over a word when it hears a spoken number). Here, assume we had no intention of penalizing the model, since it did get the right number. Let's install the handy num2words package, which converts numbers in digit form to their word form, and define a modified tokenizer.

pip install num2words
from num2words import num2words

# Define a helper function that takes as input a regex match object, assumed to be a (string) integer, and replaces it with the corresponding word(s) from num2words

def _to_words(match):
   return f' {num2words(int(match.group()))} '

# Same as before, but now we also normalize numbers

def _normalize_num(text):
   text = text.lower()
   text = re.sub(r'[\.\?\!\,\']', '', text)
   text = re.sub(r'\s([0-9]+)\s', _to_words, text)
   return text.split()

wer = _wer(truth, pred, _normalize_num)
print(f'WER: {wer}')

Finally! We have zero WER.

Before getting into non-English languages, let’s recap:

It just so happens that our text normalization does not change the model output, but this doesn’t necessarily have to be the case—a normalization function would typically operate on both. And of course, this is a simple example. There is a lot more one could do to handle numbers correctly—what about times, currencies, years, and addresses? And what about other English-specific terms? Is "Dr." the same as "doctor"? It can get complicated quickly, the more flexible you need your normalization to be. But even what we have done above may get us pretty far in some cases.

Benchmarking ASR for Non-English Languages

So...languages are hard. There are lots of them, and the way they are written can be totally different. Here is merely a sample of what you might encounter:

While some languages have a more well-defined, limited vocabulary, others are fundamentally agglutinative, meaning morphemes (word parts with some meaning) can be mashed together in many ways to form many unique words.
Most languages are read left to right, but some are read right to left.
Some only have a couple dozen characters in their alphabet and others have thousands.
Punctuation differs, and some don't really have any.
Some don't really have words, just sequences of characters.
Sometimes individual Unicode characters are stacked in order to display a word, depending on their order.
Sometimes there are totally different scripts one could use to write a language.

It's almost like different cultures just came up with stuff over tens of thousands of years, sharing ideas here and there by fighting wars and trading with one another. I'll bet they didn't even consult statisticians to determine what is efficient and optimal, or computer scientists to advise what would simplify code and display nicely in terminals. But hey, this is what makes the world an interesting place!

To benchmark non-English ASR, we not only have to pay attention to basic text normalization like the lowercasing and punctuation we considered above, but we also probably have to know at least a thing or two, and maybe a lot, about the language itself. We have to decide when two different words (or characters) should be considered the same/equivalent and when we should count an error.

Consider French, where despite their best efforts, people still write certain numbers in word form variably with and without dashes between the words:

truth = “cent-quarante-huit-mille-neuf-cent-vingt”
pred = “cent quarante huit mille neuf cent vingt”

I believe that number is 148,920, but I would be more certain had my 3 semesters of high school French not ended approximately 148,920 hours ago. Regardless, if you're not careful, you'll add 7 errors to your WER calculation from that one example alone!

Consider also Hindi, where frequent code-switching with English occurs in most modern conversation. There are also two scripts: the traditional Devanagari script and an increasingly common romanized script. For instance:

Devanagari: आपका स्वागत है
Romanized: aapka swagat hai
English translation: you are welcome

One can attempt to convert one script to the other automatically, but a rules-based approach will be imperfect, and the romanization itself is not really standardized. For instance, one of many possible rules-based romanizations for the above example might be aapaka svaagat hai, where two out of three words are misspelled relative to the generally accepted scheme. Your ground truth and, separately, your model, could either contain all Devanagari (with transliterated English), all Latin script (with romanized Hindi words), or a mix of both scripts. Sounds like fun, right? There are things like this to consider for every language.

Benchmarking Whisper

Now, let's cut to the chase and do some benchmarking of Whisper for several languages we have studied and developed ASR models for at Deepgram. We'll focus on Spanish, German, French, Hindi, and Turkish.

OpenAI presents some very impressive-looking benchmarks for the Whisper large model across several languages. For other languages, the accuracy is lower, and for some it's effectively zero (WER near or greater than 1). Still, it's quite an impressive list. Check out the plot, which can be found right in their GitHub README.

The thing is, these benchmarks are based on the FLEURS dataset, which consists of short utterances (i.e. sentences) that are read carefully by people in relatively low-noise, clean acoustic environments (you can listen to examples for yourself right on the Hugging Face page – pretty nifty). And why not, right? But this is not going to tell you much about real-world performance. So here we would like to test Whisper on more realistic data: longer audio files found "in the wild" – that is, the internet – curated by native speakers of the languages and, in some cases, selected for a variety of accents and speech domains.

Though much of our focus in this post has been on text normalization, there is another thing that is important for any sort of ASR benchmarking: accurate and consistent labels! All of the data we will test has been labeled in-house, and our teams develop their own style guides unique to each language to ensure consistent labeling. We will also use Deepgram's internal text normalization tools, which are language-specific and, while nothing is perfect, they handle many of the issues discussed above.

Below is a summary of the data we will use to benchmark each language. We tried to have enough hours (more than 10) and files (around 100 or more) to make sure that the aggregate WER statistics mean something.

We will focus on the large Whisper model for our tests. This facilitates a meaningful comparison with the FLEURS benchmarks, which were also for the large model. Also, we've found that, in general, the large model doesn't make a big difference for English ASR, but it does for other languages.

Below, we compare the distribution of WER results at the file level for each language. We also compare the overall WER (total errors / total truth words) for each language with the value reported for FLEURS data.

One thing we notice across the board is that the model does not perform nearly as well on our real-world curated data compared to the academic FLEURS dataset. This result is fairly typical of open-source models. They will generally be optimized to do well on academic datasets, but with messier real-world audio, the models struggle. That said, with the exception of Hindi, the accuracy is still reasonable. The Spanish and German results are a bit worse than the French and Turkish results; not only is the file-level WER higher on average, there's also higher variance. This is not unexpected, given that the Spanish and German audio were selected to contain a variety of accents, and the Spanish data especially has some challenging audio in the mix. It might be reasonable to conclude that Whisper does roughly equally well on all but Hindi.

Whisper Wrap-Up

So what have we learned? Hopefully you've seen why benchmarking ASR performance is a very tricky business, especially for non-English languages where each has unique considerations for text normalization, making it a challenge for a researcher who doesn't speak the language -- and even one who does! Accurate benchmarks also require accurate and consistent ground truth labels, and the results should always be contextualized by an understanding of the type of data the model is evaluated on.

We have also learned that Whisper, at least for their larger models, achieves a fairly robust and accurate ASR for at least a few languages. However, we argue that the curated public dataset WER results above are much more indicative of Whisper's performance on real-world data than benchmarks on datasets like FLEURS. And this is certainly not the end of the story. Our results on curated public audio probably do not generalize to the type of audio one would expect from many enterprise ASR customers—phone conversations, meetings, etc. We'd expect Whisper's error rate on that type of data to be even higher. If you have tested the performance of Whisper for ASR on any non-English languages, we'd love to hear about it!

Header image credit: DALL-E by OpenAI and Ross O'Connell for prompt

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .