Speaker diarization
Speaker Diarization is the process that partitions audio stream into homogenous segments according to the speaker identity. It solves the problem of "Who Speaks When". This API splits audio clip into speech segments and tags them with speakers ids accordingly. This API also supports speaker identification by speaker ID if the speaker was already enrolled using Speaker Enrollment API.
Using the Diarization API
For the best results we recommend following these guidelines.
-
The
audioType
parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter toCallCenter
when there are 2-3 speakers expected to be identified andMeeting
when 4-6 speakers are expected. -
Set the
enableVoiceActivityDetection
parameter toTrue
if you want silence and noise segments removed from the diarization output. We suggest you to set it toTrue
in most circumstances. -
Setting the
source
parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources. -
For proper speaker indentification, make sure you have previously enrolled all speakers in the media file and include them in the
speakerIds
parameter.
Request parameters
Parameter | Type | Description |
---|---|---|
encoding |
String | Encoding of audio file like MP3, WAV etc. |
languageCode |
String | Language spoken in the audio file. Default of "en-US". |
separateSpeakerPerChannel |
Boolean | Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False. |
speakerCount |
Number | Number of speakers in the file. Optional. |
audioType |
String | Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting , EarningsCalls , Interview , PressConference |
speakerIds |
List[String] | Optional set of speakers to be identified from the call. Optional. |
contentUri |
String | Publicly facing url. |
source |
String | Source of the audio file eg: Phone , RingCentral , GoogleMeet , Zoom etc. Optional. |
enableVoiceActivityDetection |
Boolean | Apply voice activity detection. Optional. Default of False. |
Example code
After you have setup a simple web server to process the response, copy and paste the code from below in index.js
and make sure to edit the variables in ALL CAPS to ensure your code runs properly.
const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();
MEDIA_URL = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';
// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
'server': process.env.RC_SERVER_URL,
'clientId': process.env.RC_CLIENT_ID,
'clientSecret': process.env.RC_CLIENT_SECRET
});
const platform = rcsdk.platform();
// Authenticate with RingCentral Developer Platdorm using Developer's JWT Credential
platform.login({
'jwt': process.env.RC_JWT
});
// Call the Speaker Diarization API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
detectSpeaker();
})
async function detectSpeaker() {
try {
console.log("Calling RingCentral Speaker Diarization API");
let resp = await platform.post("/ai/audio/v1/async/speaker-diarize?webhook=" + WEBHOOK_ADDRESS, {
"contentUri": MEDIA_URL,
"encoding": "Mpeg",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"separateSpeakerPerChannel": false,
"speakerCount": 0,
"enableVoiceActivityDetection": true
});
console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
}
catch (e) {
console.log("An error occurred : " + e.message);
}
}
Run your sample code.
$ node index.js
import requests
import base64
url = "https://platform.ringcentral.com/ai/audio/v1/async/speaker-diarize"
querystring = {"webhook":"<webhookUrl>"}
payload = {
"encoding": "Mpeg",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"separateSpeakerPerChannel": False,
"speakerCount": 2,
"speakerIds": [
"speakerId1",
"speakerId2"
],
"enableVoiceActivityDetection": True,
}
# The api accepts data either as a url or as base64 encoded content
# passing payload as contentUri:
payload["contentUri"] = "https://publicly-facing-url.mp3"
# alternatively, passing payload as content:
with open(audioFileName, 'rb') as fin:
audioContent = fin.read()
payload["content"] = base64.b64encode(audioContent).Decode('Utf-8')
headers = {
'Content-Type': "application/json",
}
response = requests.post(url, json=payload, headers=headers, params=querystring)
print(response.status_code)
Example response
{
"status": "Success",
"response": {
"speakerCount": 2,
"utterances": [
{
"speakerId": "JohnDoe",
"start": 0.3,
"end": 5.1,
"confidence": 0.97
},
...
]
}
}