Speaker identification

Last updated: 2022-12-08Contributors
Edit this page

The Speaker Identification API is used in order to determine "who speaks when" for a given media file. It is intended to be used to identify speakers who have been previously enrolled. The Speaker Identification API segments an audio clip into a sequence of utterances, each corresponding to a unique speaker. It then attempts to determine the identity of each speaker based upon their voice print. In cases where the speaker is ambiguous or unknown, utterances are marked with a status of UserNotIdentified.

Identifying speakers

Request parameters

To identify speakers using the Speaker Identification API, one must formulate a request using the following request parameters:

Parameter Type Description
encoding String Encoding of audio file like MP3, WAV etc.
languageCode String Language spoken in the audio file. Default: en-US
audioType String Type of the audio based on number of speakers. Allowed values are: CallCenter (default), Meeting, EarningsCalls, Interview, PressConference. Optional, but is useful as a hint to aid in the identification process.
speakerIds List[String] List of previously enrolled speakers to identify in the media file.
contentUri String Publicly facing URL where the media file can be accessed.
source String Source of the audio file, e.g.: Phone, RingCentral, GoogleMeet, Zoom etc

To properly identify speakers in a media file, speakers must have been previously enrolled so that their voice print can be compared to the speakers in the audio file. It also relies on the developer having some knowledge of who the likely speakers are in the media file being processed, and providing a list of those speakers in their request.

Sample code

After you have setup a simple web server to process the response, copy and paste the code from below in index.js and makesure to edit the variables in ALL CAPS to ensure your code runs properly.

const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();

MEDIA_URL   = '<INSERT URL TO MEDIA FILE>';
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';
SPEAKERS = [
    "ringcentral_test",
    "speakerId2"
]

// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
    'server':       process.env.RC_SERVER_URL,
    'clientId':     process.env.RC_CLIENT_ID,
    'clientSecret': process.env.RC_CLIENT_SECRET
});

const platform = rcsdk.platform();

// Authenticate with RingCentral Developer Platdorm using Developer's JWT Credential
platform.login({
    'jwt': process.env.RC_JWT
});

// Call the Speaker Enrollment API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
    identifySpeakers();
})

async function identifySpeakers() {
    try {
        console.log("Enrolling speaker using RingCentral Enrollment API");
        let resp = await platform.post("/ai/audio/v1/enrollments?webhook=" + WEBHOOK_URL, {
            "contentUri":   MEDIA_URL,
            "encoding":     "Mpeg",
            "languageCode": "en-US",
            "source":       "RingCentral",
            "audioType":    "Meeting",
            "speakerIds":   SPEAKERS
        });
        console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
    } 
    catch (e) {
        console.log("An error occurred : " + e.message);
    }
}

Example response

{
    "utterances": [
        {
            "start": 0,
            "end": 6,
            "speakerId": "ringcentral_test",
            "confidence": 35.42294
        },
        {
            "start": 6,
            "end": 12,
            "speakerId": "ringcentral_test",
            "confidence": 36.98796
        },
        {
            "start": 12,
            "end": 18.0,
            "speakerId": "ringcentral_test",
            "confidence": 25.51731
        }
    ]
}

Each utterance contains the following information and parameters:

Parameter Type Description
speakerId String speaker id of the identified speaker.
start Float Start of the audio segment.
end Float end of the audio segment.
confidence Number Confidence of speaker identification.