Speaker diarization

Last updated: 2023-01-20Contributors
Edit this page

Speaker Diarization is the process that partitions audio stream into homogenous segments according to the speaker identity. It solves the problem of "Who Speaks When". This API splits audio clip into speech segments and tags them with speakers ids accordingly. This API also supports speaker identification by speaker ID if the speaker was already enrolled using Speaker Enrollment API.

Using the Diarization API

For the best results we recommend following these guidelines.

  • The audioType parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter to CallCenter when there are 2-3 speakers expected to be identified and Meeting when 4-6 speakers are expected.

  • Set the enableVoiceActivityDetection parameter to True if you want silence and noise segments removed from the diarization output. We suggest you to set it to True in most circumstances.

  • Setting the source parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.

  • For proper speaker indentification, make sure you have previously enrolled all speakers in the media file and include them in the speakerIds parameter.

Request parameters

Parameter Type Description
encoding String Encoding of audio file like MP3, WAV etc.
languageCode String Language spoken in the audio file. Default of "en-US".
separateSpeakerPerChannel Boolean Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False.
speakerCount Number Number of speakers in the file. Optional.
audioType String Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting, EarningsCalls, Interview, PressConference
speakerIds List[String] Optional set of speakers to be identified from the call. Optional.
contentUri String Publicly facing url.
source String Source of the audio file eg: Phone, RingCentral, GoogleMeet, Zoom etc. Optional.
enableVoiceActivityDetection Boolean Apply voice activity detection. Optional. Default of False.

Example code

After you have setup a simple web server to process the response, copy and paste the code from below in index.js and make sure to edit the variables in ALL CAPS to ensure your code runs properly.

const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();

MEDIA_URL   = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';

// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
    'server':       process.env.RC_SERVER_URL,
    'clientId':     process.env.RC_CLIENT_ID,
    'clientSecret': process.env.RC_CLIENT_SECRET
});

const platform = rcsdk.platform();

// Authenticate with RingCentral Developer Platdorm using Developer's JWT Credential
platform.login({
    'jwt': process.env.RC_JWT
});

// Call the Speaker Diarization API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
    detectSpeaker();
})

async function detectSpeaker() {
    try {
        console.log("Calling RingCentral Speaker Diarization API");
        let resp = await platform.post("/ai/audio/v1/async/speaker-diarize?webhook=" + WEBHOOK_ADDRESS, {
            "contentUri":                   MEDIA_URL,
            "encoding":                     "Mpeg",
            "languageCode":                 "en-US",
            "source":                       "RingCentral",
            "audioType":                    "Meeting",
            "separateSpeakerPerChannel":    false,
            "speakerCount":                 0,
            "enableVoiceActivityDetection": true
        });
        console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
    } 
    catch (e) {
        console.log("An error occurred : " + e.message);
    }
}

Run your sample code.

$ node index.js
import requests
import base64

url = "https://platform.ringcentral.com/ai/audio/v1/async/speaker-diarize"

querystring = {"webhook":"<webhookUrl>"}

payload = {
    "encoding": "Mpeg",
    "languageCode": "en-US",
    "source": "RingCentral",
    "audioType": "Meeting",
    "separateSpeakerPerChannel": False,
    "speakerCount": 2,
    "speakerIds": [
        "speakerId1",
        "speakerId2"
    ],
    "enableVoiceActivityDetection": True,
}

# The api accepts data either as a url or as base64 encoded content
# passing payload as contentUri:
payload["contentUri"] = "https://publicly-facing-url.mp3"
# alternatively, passing payload as content:
with open(audioFileName, 'rb') as fin:
    audioContent = fin.read()

payload["content"] = base64.b64encode(audioContent).Decode('Utf-8')

headers = {
    'Content-Type': "application/json",
}

response = requests.post(url, json=payload, headers=headers, params=querystring)
print(response.status_code)

Example response

{
    "status": "Success",
    "response": {
      "speakerCount": 2,
      "utterances": [
        {
          "speakerId": "JohnDoe",
          "start": 0.3,
          "end": 5.1,
          "confidence": 0.97
        },
        ...
      ] 
    }
}