Extract interaction analytics from a media file

Last updated: 2023-02-09Contributors
Edit this page

Interaction analytics is used to understand a conversation happening in a meeting between two or more people and extract from them more meaningful insights at scale. This API is a comprehensive in that in addition to its unique capabilities, it also bundles functionality found in our other APIs. In processing a media file, this API will provide multiple levels of insights, including:

Let's say we want to analyze a meeting between sales rep and a customer, and that meeting lasted for twenty minutes. Here are some of the insights we can extract using this API:

  • Speaker contribution, e.g. sales rep spoke for twelve minutes, and the customer spoke for eight minutes.
  • Speaker pace, e.g. words spoken per minute.
  • Speaker emotions, e.g. what was the tone or emotional context of ever utterance.
  • Auto-generated meeting summary

Extracting interaction analytics

For the best results we recommend following these guidelines.

  • The audioType parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter to CallCenter when there are 2-3 speakers expected to be identified and Meeting when 4-6 speakers are expected.

  • Set the enableVoiceActivityDetection parameter to True if you want silence and noise segments removed from the diarization output. We suggest you to set it to True in most circumstances.

  • Setting the source parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.

  • For proper speaker indentification, make sure you have previously enrolled all speakers in the media file and include them in the speakerIds parameter.

Request parameters

Parameter Type Description
encoding String Encoding of audio file like MP3, WAV etc.
sampleRate Number Sample rate of the audio file. Optional.
languageCode String Language spoken in the audio file. Default of "en-US".
separateSpeakerPerChannel Boolean Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False.
speakerCount Number Number of speakers in the file. Optional.
audioType String Type of the audio based on number of speakers. Optional. Permitted values: CallCenter, Meeting, EarningsCalls, Interview, PressConference
speakerIds List[String] Optional set of speakers to be identified from the call. Optional.
enableVoiceActivityDetection Boolean Apply voice activity detection. Optional. Default of False.
contentUri String Publicly facing url.
source String Source of the audio file eg: Phone, RingCentral, GoogleMeet, Zoom etc. Optional.
insights List[String] List of metrics to be run. Send ['All'] to extract all analytics. Permitted Values: All, KeyPhrases, Emotion, AbstractiveSummaryLong, AbstractiveSummaryShort, ExtractiveSummary, TalkToListenRatio, Energy, Pace, QuestionsAsked, Title, Tasks.

Example code

After you have setup a simple web server to process the response, copy and paste the code from below in index.js and make sure to edit the variables in ALL CAPS to ensure your code runs properly.

const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();

MEDIA_URL   = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';

// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
    'server':       process.env.RC_SERVER_URL,
    'clientId':     process.env.RC_CLIENT_ID,
    'clientSecret': process.env.RC_CLIENT_SECRET
});

const platform = rcsdk.platform();

// Login into the Developer Portal using Developer's JWT Credential
platform.login({
    'jwt': process.env.RC_JWT
});

// Call the Interaction Analysis API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
    analyzeInteraction();
})

async function analyzeInteraction() {
    try {
        let resp = await platform.post("/ai/insights/v1/async/analyze-interaction?webhook=" + WEBHOOK_URL,{
            "contentUri":                   MEDIA_URL,
            "encoding":                     "Wav",
            "languageCode":                 "en-US",
            "source":                       "RingCentral",
            "audioType":                    "Meeting",
            "insights":                     [ "All" ],
            "enableVoiceActivityDetection": true,
            "enablePunctuation":            true,
            "enableSpeakerDiarization":     false
        });
        console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
    } 
    catch (e) {
        console.log("An Error Occurred : " + e.message);
    }
}

You are almost done. Now run your script to make the request and receive the response.

$ node index.js
import os,sys
import logging
import requests
from ringcentral import SDK
from dotenv import load_dotenv

# Load Enviroment variables
load_dotenv()

# Invoke Interaction Analysis API 
def analyzeInteractions():
    # Endpoint to invoke Interaction analysis API 
    endpoint = os.getenv('RC_SERVER_URL')+"/ai/insights/v1/async/analyze-interaction"

    # Webhook as Query string
    querystring = {"webhook":os.getenv('WEBHOOK_ADDRESS')}

    # Payload
    payload = {
        "contentUri": "https://github.com/suyashjoshi/ringcentral-ai-demo/blob/master/public/audio/sample1.wav?raw=true",
        "encoding": "Wav",
        "languageCode": "en-US",
        "source": "RingCentral",
        "audioType": "Meeting",
        "insights": ["All"],
        "enableVoiceActivityDetection": True,
        "enablePunctuation": True,
        "enableSpeakerDiarization": False
    }

    try:
        # Instantiate Ringcentral SDK 
        rcsdk = SDK( os.getenv('RC_CLIENT_ID'),os.getenv('RC_CLIENT_SECRET'),os.getenv('RC_SERVER_URL'))
        platform = rcsdk.platform()

        # Login Using JWT
        platform.login( jwt=os.getenv('RC_JWT') );

        # Make HTTP POST call to the Interaction analysis endpoint with the query string and payload
        response = platform.post(endpoint, payload, querystring);
        print(response.json());

    except Exception as e:  
        print(e)

try:
    analyzeInteractions()
except Exception as e:
    print(e)

Run Your Code

You are almost done. Now run your script to make the request and receive the response.

$ python3 app.py

Example response

{
    "status": "Success",
    "response": {
        "utteranceInsights": [
            {
                "start": 2.52,
                "end": 6.53,
                "text": "Could produce large hail isolated tornadoes and heavy rain.",
                "confidence": 0.93,
                "speakerId": "1",
                "insights": [
                    {
                        "name": "Emotion",
                        "value": "Neutral",
                        "confidence": 0.7
                    }
                ]
            }
        ],
        "speakerInsights": {
            "speakerCount": 2,
            "insights": [
                {
                    "name": "Energy",
                    "values": [
                        {
                            "speakerId": "0",
                            "value": 86.64
                        },
                        {
                            "speakerId": "1",
                            "value": 62.69
                        }
                    ]
                },
                {
                    "name": "TalkToListenRatio",
                    "values": [
                        {
                            "speakerId": "0",
                            "value": "32:68"
                        },
                        {
                            "speakerId": "1",
                            "value": "68:32"
                        }
                    ]
                },
                {
                    "name": "QuestionsAsked",
                    "values": [
                        {
                            "speakerId": "0",
                            "value": 0,
                            "questions": []
                        },
                        {
                            "speakerId": "1",
                            "value": 0,
                            "questions": []
                        }
                    ]
                }
            ]
        },
        "conversationalInsights": [
            {
                "name": "KeyPhrases",
                "values": []
            },
            {
                "name": "ExtractiveSummary",
                "values": [
                    {
                        "value": "Could produce large hail isolated tornadoes and heavy rain.",
                        "start": 2.52,
                        "end": 6.53,
                        "speakerId": "1",
                        "confidence": 0.51
                    }
                ]
            },
            {
                "name": "Topics",
                "values": []
            },
            {
                "name": "Tasks",
                "values": []
            },
            {
                "name": "AbstractiveSummaryLong",
                "values": []
            },
            {
                "name": "AbstractiveSummaryShort",
                "values": []
            }
        ]
    }
}

NOTES:

  • In case of ExtractiveSummary, the start and end times refer to the exact time of the segment.
  • In case of AbstractiveSummaryLong and AbstractiveSummaryShort the start and end time refer to the time of text blob which is abstracted.

Interaction-Analytics-Object

Parameter Type Description
utteranceInsights List[Utterance-Insights] List of utterances and the insights computed for each utterance.
speakerInsights Object The set of insights computed for each speaker separately.
conversationalInsights List[Conversational-Insights-Object] List of insights computed by analyzing the conversation as a whole.

Utterance-Insights-Object

Parameter Type Description
speakerId String The speaker id for the corresponding audio segment.
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
text String The transcription output corresponding to the segment.
confidence Number The confidence score for the transcribed segment.
insights List[Utterance-Insights-Unit] List of utterance level insights

Utterance-Insights-Unit

Parameter Type Description
name String Enum Possible values: Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust, Neutral.
value String Value corresponding to the insight.
confidence Number Confidence Score. Optional.

Speaker-Insights-Object

Parameter Type Description
speakerCount Number Number of speakers detected. In case speakerCount isn't set, the number of speakers are estimated algorithmically.
insights List[Speaker-Insights-Unit] List of utterance level insights. Each insight is computed separately for each speaker.

Speaker-Insights-Unit

Parameter Type Description
name String Enum Name of the insight. Possible values: Energy, Pace, TalkToListenRatio
values List[Speaker-Insights-Value-Unit] Value corresponding to the insight

Speaker-Insights-Value-Unit

Parameter Type Description
speakerId String The speaker id for whom insights are computed.
value Number The computed value of the insight for this speaker.

Timed-Segment

Parameter Type Description
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.

Conversational-Insights-Object

Parameter Type Description
name String Enum Name of the insight. Possible values: AbstractiveSummaryLong, AbstractiveSummaryShort, ExtractiveSummary, KeyPhrases, Tasks, Titles, QuestionsAsked
values List[Conversational-Insights-Value-Unit] Value corresponding to the insight

Conversational-Insights-Value-Unit

Parameter Type Description
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
value String The output corresponding to the insight.
confidence Number The confidence score for the computed insight.