So, you want to run the OpenAI Whisper tool on your machine? You can load it from the OpenAI Github repository to get up and going!

Setup

You'll need python on your machine, at least version 3.7. Let's set up a virtual environment with venv (or conda or the like) if you want to isolate these experiments from other work.

mkdir whisper
cd whisper
python3 -m venv venv
source venv/bin/activate

# always a good idea to make sure pip is up-to-date
pip3 install --upgrade pip

Next, install a clone of the Whisper package and its dependencies (torch, numpy, transformers, tqdm, more-itertools, and ffmpeg-python) into your python environment.

pip3 install git+https://github.com/openai/whisper.git

Especially if it's pulling torch for the first time, this may take a little while. The repository documentation advises that if you get errors building the wheel for tokenizers, you may also need to install rust. You'll also need ffmpeg - installation depends on your platform. Here are some examples:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Using the Tool

Great! You're ready to transcribe! In this example, we're working with Nicholas Tesla's vision of a wireless future - you can get this audio file at the LibriVox archive of public-domain audiobooks and bring it to your local machine if you don't have something queued up and ready to go.

The OpenAI Whisper tool has a variety of models that are English-only or multilingual, and come in a range of sizes whose tradeoffs are speed vs. performance. You can learn more about this here. We, the researchers at Deepgram, have found that the small model provides a good balance.

whisper "snf025_nikolateslawirelessvision_anonymous_gu.mp3" --model small --language English
[00:00.000 --> 00:09.880]  Nikola Tesla sees a wireless vision by Anonymous, the New York Times, 3rd October, 1915.
[00:09.880 --> 00:11.920]  This is a LibriVox recording.
[00:11.920 --> 00:14.920]  All LibriVox recordings are in the public domain.
[00:14.920 --> 00:20.200]  For more information or to volunteer, please visit LibriVox.org.
[00:20.200 --> 00:23.760]  Nikola Tesla sees a wireless vision.
[00:23.760 --> 00:29.080]  Things his world system will allow hundreds to talk at once through the Earth.
[00:29.080 --> 00:31.760]  Trans-static disturbance.
[00:31.760 --> 00:37.480]  Inventor hopes also to transmit pictures by the same medium which carries the voice.
[00:37.480 --> 00:43.880]  Nikola Tesla announced to the Times last night that he had received a patent on an invention
[00:43.880 --> 00:50.320]  which would not only eliminate static interference, the present bugaboo of wireless telephony,
[00:50.320 --> 00:55.520]  but would enable thousands of persons to talk at once between wireless stations and make
[00:55.520 --> 01:01.920]  it possible for those talking to see one another by wireless regardless of the distance separating

...

[07:25.160 --> 07:30.520]  Wireless is coming to mankind in its full meaning like a hurricane some of these days.
[07:30.520 --> 07:36.120]  Some day there will be, say, six great wireless telephone stations in a world system connecting
[07:36.120 --> 07:42.080]  all the inhabitants on this earth to one another, not only by voice, but by sight.
[07:42.080 --> 07:45.240]  Its surely coming.
[07:45.240 --> 07:50.940]  End of Nikola Tesla sees a wireless vision by Anonymous, The New York Times, 3rd October
[07:50.940 --> 08:13.840]  1915.

Deepgram's Whisper API Endpoint

Getting the Whisper tool working on your machine may require some fiddly work with dependencies - especially for Torch and any existing software running your GPU. Our OpenAI Whisper API endpoint is easy to work with on the command-line - you can use curl to quickly send audio to our API.

This call will send your file to the API and save it to a local JSON file called n_tesla.json:

curl --request POST \
  --upload-file snf025_nikolateslawirelessvision_anonymous_gu.mp3 \
  --url 'https://api.deepgram.com/v1/listen?model=whisper' \
  --output n_tesla.json

The JSON file is returned in Deepgram's format that offers the transcript as well as information about your transcription request. A quick way to view the transcript is to use the jq tool:

jq.results.channels[0].alternatives[0].transcript n_tesla.json 

...and here's your transcript!

"Nikola Tesla sees a wireless vision by anonymous the New York Times 3rd October 1915. This is a LeapRvox recording. All LeapRvox recordings are in the public domain. For more information or to volunteer, please visit LeapRvox.org. Nikola Tesla sees a wireless vision. Things his world system will allow hundreds to talk at once through the earth. Nikola Tesla responds static disturbance. Inventors hopes also to transmit pictures by the same medium which carries the voice. Nikola Tesla announced to the Times last night that he had received a patent on an invention which would not only eliminate static interference, the present bugaboo of wireless telephony, but would enable thousands of persons to talk at once between wireless stations and make it possible for those talking to see one another by wireless, regardless of the distance separating them. 

...

Wireless is coming to mankind, and it's full meaning like a hurricane some of these days. Some day there will be, say, six great wireless telephone stations in a world system, connecting all the inhabitants on this earth to one another, not only by voice, but by sight. It's surely coming. End of Nikola Tesla sees a wireless vision by anonymous, the New York Times 3rd October 1915."

But Wait. Those Transcripts Aren't the Same.

Excellent observation! The local run was able to transcribe "LibriVox," while the API call returned "LeapRvox." This is an artifact of this kind of model - their results are not deterministic. That is, some optimizations for working with large quantities of audio depend on overall system state and do not produce precisely the same output between runs. Our observations are that the resulting differential is typically on the order of 1% (absolute) fluctuations in word-error rate.

If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions .

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

Sign Up FreeBook a Demo