Resources Article How to Transcribe a YouTube Video (Programmatically)

How to Transcribe a YouTube Video (Programmatically)

Jose Nicholas Francisco

Published on 05/16/23Updated on 10/11/23

Share this guide

Are you a podcaster or a vlogger? Or perhaps you’re simply an audio nerd, like us friendly folks over here at Deepgram. Well, if you’re looking for a simple way to transcribe any YouTube video programmatically, stick around! The code below is just for you.

You can automate transcriptions that are more accurate and better formatted than the default YouTube transcriptions. You can summarize long lectures and podcasts. And you can even translate your transcripts into different languages!

This article will cover the first part: Getting beautiful, accurate, and easy-to-read transcripts with timestamps. Where you take those transcripts is up to you—You’re limited only by your own creativity!

Let’s get started!

First thing’s first, we’re going to need some way to download the audio from a YouTube video locally. Thankfully, there are libraries that already do that work for us!

pip install youtube_dl

And once you’ve got that, then you’re good to go! The code snippet below takes as input a list of URLs to download. The outputs are a series of .mp3 files, each with the same title as the video they came from.

Import youtube_dl
Vids = [URL to desired video here’, ... ]
ydl_opts = {
   'format': 'bestaudio/best',
   'postprocessors': [{
       'key': 'FFmpegExtractAudio',
       'preferredcodec': 'mp3',
       'preferredquality': '192',
   }],
   # change this to change where you download it to
   'outtmpl': './outputs/%(title)s.mp3',
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
   ydl.download(vids)

Great! We have audios! The only thing left is to transcribe them. You just need a few lines of code:

from deepgram import Deepgram
import asyncio, json

DEEPGRAM_API_KEY = 'YOUR_API_KEY_HERE' 
FILENAME = "Filename here" 
PARAMS = {'punctuate': True,'tier': 'enhanced'} 

async def main(): 

  # Initialization 
  deepgram = Deepgram(DEEPGRAM_API_KEY) 
  print("Currently transcribing ", FILENAME) 
  
  # start transcribing 
  with open(FILENAME, 'rb') as audio: 
    source = {'buffer': audio, 'mimetype': 'audio/mp3'} 
    response = await deepgram.transcription.prerecorded(source, PARAMS) 
    json_object = json.dumps(response, indent=4) 
    
    # write results 
    transcrption_file = './' + FILENAME[:-4] + '.json' 
    
    with open(transcrption_file, "w") as outfile: 
      outfile.write(json_object) print("Bag secured ") 
    
asyncio.run(main())

Alright, let’s take this one chunk at a time:

The first two lines are imports. Nothing out of the ordinary there.
The next chunk of four lines are set-up parameters. Fill them in as follows:
- DEEPGRAM_API_KEY should be set to the API key you created upon signing up for Deepgram here.
- FILENAME should be set to the path of the file you wish to transcribe, written as a string.
- PARAMS is fine as is, but if you’d like to change the look of your output—whether that’s diarizing it, filtering profanity, or using numerals—check out the docs here!
The main() function does the following:
- Opens the audio file
- Calls the transcription API
- Outputs the transcript to a JSON file

Or, to transcribe multiple files at once, run the following block of code instead:

from deepgram import Deepgram
import asyncio, json
import time as t
DEEPGRAM_API_KEY = 'YOUR_API_KEY_HERE'
PREFIXES = ['title_1', 'title_2', ... 'title_n']
PARAMS = {'punctuate': True,'tier': 'enhanced'}
async def main(prefixes):
   results = {}
   for prefix in prefixes:
       filename = "../audios/" + prefix + ".mp3"
       # Initialization
       deepgram = Deepgram(DEEPGRAM_API_KEY)
       print("Currently transcribing ", filename)
       # start transcribing
       with open(filename, 'rb') as audio:
           source = {'buffer': audio, 'mimetype': 'audio/mp3'}
           start = t.time()
           response = await deepgram.transcription.prerecorded(source, PARAMS)
           end = t.time()
           print('Time: ' + str(end - start))
       # write results
       num_words_transcribed = len(response['results']['channels'][0]['alternatives'][0]['transcript'])
       speed = end - start
       time_per_word = speed / num_words_transcribed
       stats = {
           'num_words' : num_words_transcribed,
           'time_to_transcribe': speed,
           'time_per_word': time_per_word
       }
       results[prefix] = stats
  
   with open('dg_times.txt', 'w') as f:
       for title, stats in results.items():
           f.write(title + ':\n')
           for name, number in stats.items():
               f.write(name + ' : ' + str(number) + '\n')
           f.write('\n\n')
   print('FINISH LINE REACHED')
asyncio.run(main(PREFIXES))

Here, instead of filling in a single FILENAME, you fill in the names of multiple files in the list named PREFIXES. Note that we do not include the filetype or mimetype in the prefixes list.That is, if you have an audio file named huberman_podcast.mp3, the prefix you’d enter into the PREFIXES list would be huberman_podcast.)

The async call should parallelize the transcriptions so that everything runs as efficiently as possible.

And boom! By the end of all this transcription, you should have a bunch of JSON files that look like this:

{
   "metadata": {
       "transaction_key": "deprecated",
       "request_id": "d096919b-d443-4024-b209-15e17a428c39",
       "sha256": "4e38ba6f8b9adb70460ce99bf2c210e131966bb7346db491524a98b347fbd99e",
       "created": "2023-03-21T20:56:35.196Z",
       "duration": 1297.1102,
       "channels": 1,
       "models": [
           "125125fb-e391-458e-a227-a60d6426f5d6"
       ],
        [...]
    },
   "results": {
       "channels": [
           {
               "alternatives": [
                   {
                       "transcript": "This is a Liberty VOXX recording. All Liberty VOXX recordings are in the public domain. For more information, please visit libertyvox dot org. Record it by Sherry Kraulder. Emma by Jane Austin. Chapter one. Emma would host handsome, clever, and rich with a comfortable home and happy disposition seemed to unite some of the best blessings of existence and had lived nearly twenty one years in the world with very little to distress or vex her [...] End of chapter one.",
                       "confidence": 0.99851304,
                       "words": [
                           {
                               "word": "this",
                               "start": 1.5592201,
                               "end": 1.9190401,
                               "confidence": 0.988739,
                               "punctuated_word": "This"
                           }, 
                                [...]

Note that the JSON above is abridged for the sake of brevity. Nevertheless, we have word-level timestamps alongside a full transcript.

And if you’d like sentence-level or paragraph-level timestamps, those are readily available as well. Just modify the parameters appropriately. Check out more details here!

And that’s it! By this point, you should have a full transcription, metadata, and word-level timestamps for each of the YouTube videos you downloaded.

So, you’ve got your videos transcribed. You’re probably wondering what to do now? Well, might I suggest:

Use our diarize tool to parse conversations. You may even analyze which Late Night TV talk show hosts allow their guests to talk the most! (By the way, the diarize tool also helps you analyze podcast conversations)
Run your transcript through Google Translate API to create a translation of your original video.
Build your own closed-captioning tool!
Or, if you end up using our live-transcription feature, you can create closed-captions live as well!