Extract interaction analytics from a media file
Interaction analytics is used to understand a conversation happening in a meeting between two or more people and extract from them more meaningful insights at scale. This API is a comprehensive in that in addition to its unique capabilities, it also bundles functionality found in our other APIs. In processing a media file, this API will provide multiple levels of insights, including:
- Conversation insights
- transcription with smart punctuation
- content summaries
- keywords and conversation metrics
- Speaker-level insights
- Utterance-level insights
- emotion recognition
Let's say we want to analyze a meeting between sales rep and a customer, and that meeting lasted for twenty minutes. Here are some of the insights we can extract using this API:
- Speaker contribution, e.g. sales rep spoke for twelve minutes, and the customer spoke for eight minutes.
- Speaker pace, e.g. words spoken per minute.
- Speaker emotions, e.g. what was the tone or emotional context of ever utterance.
- Auto-generated meeting summary
Extracting interaction analytics
For the best results we recommend following these guidelines.
-
The
audioType
parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter toCallCenter
when there are 2-3 speakers expected to be identified andMeeting
when 4-6 speakers are expected. -
Set the
enableVoiceActivityDetection
parameter toTrue
if you want silence and noise segments removed from the diarization output. We suggest you to set it toTrue
in most circumstances. -
Setting the
source
parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources. -
For proper speaker indentification, make sure you have previously enrolled all speakers in the media file and include them in the
speakerIds
parameter.
Request parameters
Parameter | Type | Description |
---|---|---|
encoding |
String | Encoding of audio file like MP3, WAV etc. |
sampleRate |
Number | Sample rate of the audio file. Optional. |
languageCode |
String | Language spoken in the audio file. Default of "en-US". |
separateSpeakerPerChannel |
Boolean | Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False . |
speakerCount |
Number | Number of speakers in the file. Optional. |
audioType |
String | Type of the audio based on number of speakers. Optional. Permitted values: CallCenter , Meeting , EarningsCalls , Interview , PressConference |
speakerIds |
List[String] | Optional set of speakers to be identified from the call. Optional. |
enableVoiceActivityDetection |
Boolean | Apply voice activity detection. Optional. Default of False . |
contentUri |
String | Publicly facing url. |
source |
String | Source of the audio file eg: Phone , RingCentral , GoogleMeet , Zoom etc. Optional. |
insights |
List[String] | List of metrics to be run. Send ['All'] to extract all analytics. Permitted Values: All , KeyPhrases , Emotion , AbstractiveSummaryLong , AbstractiveSummaryShort , ExtractiveSummary , TalkToListenRatio , Energy , Pace , QuestionsAsked , Title , Tasks . |
Example code
After you have setup a simple web server to process the response, copy and paste the code from below in index.js
and make sure to edit the variables in ALL CAPS to ensure your code runs properly.
const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();
MEDIA_URL = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';
// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
'server': process.env.RC_SERVER_URL,
'clientId': process.env.RC_CLIENT_ID,
'clientSecret': process.env.RC_CLIENT_SECRET
});
const platform = rcsdk.platform();
// Login into the Developer Portal using Developer's JWT Credential
platform.login({
'jwt': process.env.RC_JWT
});
// Call the Interaction Analysis API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
analyzeInteraction();
})
async function analyzeInteraction() {
try {
let resp = await platform.post("/ai/insights/v1/async/analyze-interaction?webhook=" + WEBHOOK_URL,{
"contentUri": MEDIA_URL,
"encoding": "Wav",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"insights": [ "All" ],
"enableVoiceActivityDetection": true,
"enablePunctuation": true,
"enableSpeakerDiarization": false
});
console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
}
catch (e) {
console.log("An Error Occurred : " + e.message);
}
}
You are almost done. Now run your script to make the request and receive the response.
$ node index.js
import os,sys
import logging
import requests
from ringcentral import SDK
from dotenv import load_dotenv
# Load Enviroment variables
load_dotenv()
# Invoke Interaction Analysis API
def analyzeInteractions():
# Endpoint to invoke Interaction analysis API
endpoint = os.getenv('RC_SERVER_URL')+"/ai/insights/v1/async/analyze-interaction"
# Webhook as Query string
querystring = {"webhook":os.getenv('WEBHOOK_ADDRESS')}
# Payload
payload = {
"contentUri": "https://github.com/suyashjoshi/ringcentral-ai-demo/blob/master/public/audio/sample1.wav?raw=true",
"encoding": "Wav",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"insights": ["All"],
"enableVoiceActivityDetection": True,
"enablePunctuation": True,
"enableSpeakerDiarization": False
}
try:
# Instantiate Ringcentral SDK
rcsdk = SDK( os.getenv('RC_CLIENT_ID'),os.getenv('RC_CLIENT_SECRET'),os.getenv('RC_SERVER_URL'))
platform = rcsdk.platform()
# Login Using JWT
platform.login( jwt=os.getenv('RC_JWT') );
# Make HTTP POST call to the Interaction analysis endpoint with the query string and payload
response = platform.post(endpoint, payload, querystring);
print(response.json());
except Exception as e:
print(e)
try:
analyzeInteractions()
except Exception as e:
print(e)
Run Your Code
You are almost done. Now run your script to make the request and receive the response.
$ python3 app.py
Example response
{
"status": "Success",
"response": {
"utteranceInsights": [
{
"start": 2.52,
"end": 6.53,
"text": "Could produce large hail isolated tornadoes and heavy rain.",
"confidence": 0.93,
"speakerId": "1",
"insights": [
{
"name": "Emotion",
"value": "Neutral",
"confidence": 0.7
}
]
}
],
"speakerInsights": {
"speakerCount": 2,
"insights": [
{
"name": "Energy",
"values": [
{
"speakerId": "0",
"value": 86.64
},
{
"speakerId": "1",
"value": 62.69
}
]
},
{
"name": "TalkToListenRatio",
"values": [
{
"speakerId": "0",
"value": "32:68"
},
{
"speakerId": "1",
"value": "68:32"
}
]
},
{
"name": "QuestionsAsked",
"values": [
{
"speakerId": "0",
"value": 0,
"questions": []
},
{
"speakerId": "1",
"value": 0,
"questions": []
}
]
}
]
},
"conversationalInsights": [
{
"name": "KeyPhrases",
"values": []
},
{
"name": "ExtractiveSummary",
"values": [
{
"value": "Could produce large hail isolated tornadoes and heavy rain.",
"start": 2.52,
"end": 6.53,
"speakerId": "1",
"confidence": 0.51
}
]
},
{
"name": "Topics",
"values": []
},
{
"name": "Tasks",
"values": []
},
{
"name": "AbstractiveSummaryLong",
"values": []
},
{
"name": "AbstractiveSummaryShort",
"values": []
}
]
}
}
NOTES:
- In case of
ExtractiveSummary
, the start and end times refer to the exact time of the segment. - In case of
AbstractiveSummaryLong
andAbstractiveSummaryShort
the start and end time refer to the time of text blob which is abstracted.
Interaction-Analytics-Object
Parameter | Type | Description |
---|---|---|
utteranceInsights |
List[Utterance-Insights] | List of utterances and the insights computed for each utterance. |
speakerInsights |
Object | The set of insights computed for each speaker separately. |
conversationalInsights |
List[Conversational-Insights-Object] | List of insights computed by analyzing the conversation as a whole. |
Utterance-Insights-Object
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for the corresponding audio segment. |
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
text |
String | The transcription output corresponding to the segment. |
confidence |
Number | The confidence score for the transcribed segment. |
insights |
List[Utterance-Insights-Unit] | List of utterance level insights |
Utterance-Insights-Unit
Parameter | Type | Description |
---|---|---|
name |
String Enum | Possible values: Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust, Neutral. |
value |
String | Value corresponding to the insight. |
confidence |
Number | Confidence Score. Optional. |
Speaker-Insights-Object
Parameter | Type | Description |
---|---|---|
speakerCount |
Number | Number of speakers detected. In case speakerCount isn't set, the number of speakers are estimated algorithmically. |
insights |
List[Speaker-Insights-Unit] | List of utterance level insights. Each insight is computed separately for each speaker. |
Speaker-Insights-Unit
Parameter | Type | Description |
---|---|---|
name |
String Enum | Name of the insight. Possible values: Energy , Pace , TalkToListenRatio |
values |
List[Speaker-Insights-Value-Unit] | Value corresponding to the insight |
Speaker-Insights-Value-Unit
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for whom insights are computed. |
value |
Number | The computed value of the insight for this speaker. |
Timed-Segment
Parameter | Type | Description |
---|---|---|
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
Conversational-Insights-Object
Parameter | Type | Description |
---|---|---|
name |
String Enum | Name of the insight. Possible values: AbstractiveSummaryLong , AbstractiveSummaryShort , ExtractiveSummary , KeyPhrases , Tasks , Titles , QuestionsAsked |
values |
List[Conversational-Insights-Value-Unit] | Value corresponding to the insight |
Conversational-Insights-Value-Unit
Parameter | Type | Description |
---|---|---|
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
value |
String | The output corresponding to the insight. |
confidence |
Number | The confidence score for the computed insight. |