free
hit counters

Detecting speaker changes in an audio recording — Speaker Diarization with the Deepgram API

Phone call, meeting, and audio recordings frequently feature multiple speakers on a single channel. In order to automatically distinguish between two people speaking on the phone or members of an engineering team going through their daily standup, Deepgram’s API offers the diarize parameter.

When included in a request, setting diarize to true will assign each word of the resulting transcript a speaker number, starting with 0. Diarization is available via batch and realtime modes.

For example, to diarize an audio file through the /listen endpoint with cURL, you’d set diarize=true and swap in your username and password. In this example the request will default to using the Deepgram General model. To use a different model, be sure to specify your model as model={your-model}:

Diarizing a Hosted file

    curl \
    -X POST \
    -u USERNAME:PASSWORD \
    -H "Content-Type: application/json" \
    -d '{"url": "https://www.deepgram.com/examples/interview_speech-analytics.wav"}' \
    "https://brain.deepgram.com/v2/listen?diarize=true"

Diarizing a Local file

    curl \
    -X POST \
    -u USERNAME:PASSWORD \
    -H "Content-Type: audio/wav" \
    --data-binary @path/to/myaudio.wav \
    "https://brain.deepgram.com/v2/listen?diarize=true"

Shortly after you’ll receive a response that looks like that following:

{
  ...
  "results": {
    "channels": [
      {
        "alternatives": [
          {
            "transcript": "hi jan hey sam how are you",
            "confidence": 0.86951274,
            "words": [
              {
                "word": "hi",
                "start": 0.29923075,
                "end": 0.41892305,
                "confidence": 0.99851257,
                "speaker": 0
              },
              {
                "word": "jan",
                "start": 0.41892305,
                "end": 0.5785128,
                "confidence": 0.9897183,
                "speaker": 0
              },
              {
                "word": "hey",
                "start": 0.5785128,
                "end": 0.8178974,
                "confidence": 0.9984775,
                "speaker": 1
              },
              {
                "word": "sam",
                "start": 0.8178974,
                "end": 1.057282,
                "confidence": 0.97590375,
                "speaker": 1
              },
              {
                "word": "how",
                "start": 1.057282,
                "end": 1.557282,
                "confidence": 0.9286296,
                "speaker": 1
              },
              {
                "word": "are",
                "start": 1.6557435,
                "end": 2.1557436,
                "confidence": 0.957287,
                "speaker": 1
              },
              {
                "word": "you",
                "start": 2.17441,
                "end": 2.4137948,
                "confidence": 0.9817473,
                "speaker": 1
              }
            ]
          }
        ]
      }
    ]
  }
}

You’ll notice that the speakers labels are designated at the word-level. To piece together a transcript with this output, we’ve created a python script for you to use.

Speaker Diarization can identify a maximum of 10 speakers, and thus defaults to max_speakers=10. If you know there are fewer speakers and would like to experiment with adjusting this, you can use the max_speakers parameter alongside diarize. This will reduce the number of possible speakers and may (but is not guaranteed to) improve performance, but is not required.

In addition to designating speaker changes with diarize , Deepgram’s API users can take advantage of the multichannel parameter to separate resulting transcripts by channel. This feature is also detailed in our documentation.