Text transcription

Last updated: 2022-12-08Contributors
Edit this page

Speech-to-text, automatic speech recognition, or transcription is the process of converting the audio contents of a media file into structured breakdown of what was said, by whom, and when. In addition, the API has the ability to automatically modify the transcribed text with proper punctuation for easier reading and consumption. If speaker diarization is enabled, then the response will also contain a speaker count and associate each spoken word with the most probably speaker.

The Speech-to-Text API is best used in tandem with speaker enrollment to help identify the speakers in a media file. Speaker identification relies upon the developer to pass to the API a list of potential speakers so that they can be identified.

English is currently the only supported language.

Transcribing speech to text in media files

Request parameters

Parameter Type Description
encoding String Encoding of audio file like MP3, WAV etc.
languageCode String Language spoken in the audio file. Default of "en-US".
contentUri String Publicly facing url.
audioType String Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting, EarningsCalls, Interview, PressConference
source String The source for the audio file: Webex, Zoom, GotoMeeting, Phone. Optional. The value will be used if enableSpeakerDiarization is set to True.
speakerCount Number Number of speakers in the file. Set to -1 (default) if there are an unknown number of speakers. Optional. The value will be used if enableSpeakerDiarization is set to True.
speakerIds List[String] Optional set of speakers to be identified. Optional. The value will be used if enableSpeakerDiarization is set to True.
enableVoiceActivityDetection Boolean Apply voice activity detection. Optional. Default of False. The value will be used if enableSpeakerDiarization is set to True.
enablePunctuation Boolean Enables RingCentral's Smart Punctuation API. Optional. Default of True.
enableSpeakerDiarization Boolean Tags each word corresponding to the speaker. Optional. Default of False.
separateSpeakerPerChannel Boolean Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False. The value will be used if enableSpeakerDiarization is set to True.
source String Source of the audio file eg: Phone, RingCentral, GoogleMeet, Zoom etc. Optional.
  • The audioType parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter to CallCenter when there are 2-3 speakers expected to be identified and Meeting when 4-6 speakers are expected.

  • Set the enableVoiceActivityDetection parameter to True if you want silence and noise segments removed from the diarization output. We suggest you to set it to True in most circumstances.

  • Setting the source parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.

Example code

After you have setup a simple web server to process the response, copy and paste the code from below in index.js and make sure to edit the variables in ALL CAPS to ensure your code runs properly.

const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();

MEDIA_URL   = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';

// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
    'server':       process.env.RC_SERVER_URL,
    'clientId':     process.env.RC_CLIENT_ID,
    'clientSecret': process.env.RC_CLIENT_SECRET
});

const platform = rcsdk.platform();

// Authenticate with RingCentral Developer Platdorm using Developer's JWT Credential
platform.login({
    'jwt': process.env.RC_JWT
});

// Call the Speech to Text API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
    speechToText();
})

async function speechToText() {
    try {
        console.log("Calling RingCentral Speech To Text API");
        let resp = await platform.post("/ai/audio/v1/async/speech-to-text?webhook=" + WEBHOOK_ADDRESS, {
            "contentUri":               MEDIA_URL,
            "encoding":                 "Wav",
            "languageCode":             "en-US",
            "source":                   "RingCentral",
            "audioType":                "Meeting",
            "enablePunctuation":        true,
            "enableSpeakerDiarization": false
        });
        console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
    } 
    catch (e) {
        console.log("An Error Occurred : " + e.message);
    }
}

Run your sample code.

$ node index.js
import os,sys
import logging
import requests
from ringcentral import SDK
from dotenv import load_dotenv
from http.server import BaseHTTPRequestHandler, HTTPServer

# Load Enviroment variables
load_dotenv()

# Handle Incoming HTTP requests
class S(BaseHTTPRequestHandler):
    def _set_response(self):
        self.send_response(200)
        self.send_header('Content-type', 'text/html')
        self.end_headers()

    def do_POST(self):
        content_length = int(self.headers['Content-Length']) # <--- Gets the size of data
        post_data = self.rfile.read(content_length) # <--- Gets the data itself
        if self.path == '/webhook':
            print(post_data)
        self._set_response()

# Invoke Speech to Text API
def speechToText():

    # Endpoint to invoke Speech to Text API 
    endpoint = os.getenv('RC_SERVER_URL')+"/ai/audio/v1/async/speech-to-text"

    querystring = {"webhook":os.getenv('WEBHOOK_ADDRESS')}

    # Payload
    payload = {
        "contentUri": "https://github.com/suyashjoshi/ringcentral-ai-demo/blob/master/public/audio/sample1.wav?raw=true",
        "encoding": "Wav",
        "languageCode": "en-US",
        "source": "RingCentral",
        "audioType": "Meeting",
        "enablePunctuation": True,
        "enableSpeakerDiarization": False
    }
    try:
        # Instantiate Ringcentral SDK 
        rcsdk = SDK( os.getenv('RC_CLIENT_ID'),os.getenv('RC_CLIENT_SECRET'),os.getenv('RC_SERVER_URL'))
        platform = rcsdk.platform()

        # Login Using JWT
        platform.login( jwt=os.getenv('RC_JWT') );

        # Make HTTP POST call to the Speech to Text endpoint with the query string and payload
        response = platform.post(endpoint, payload, querystring);
        print(response.json());

    except Exception as e:
        print(e)   

# Create HTTP server to listen on the defined port
def run(server_class=HTTPServer, handler_class=S, port=8080):
    logging.basicConfig(level=logging.INFO)
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    logging.info('Starting httpd...\n')
    try:
        httpd.serve_forever()
    except KeyboardInterrupt:
        pass
    httpd.server_close()
    logging.info('Stopping httpd...\n')

try:
    speechToText()
    run()

except Exception as e:
    print(e)

You are almost done. Now run your script to make the request and receive the response.

$ python3 app.py

Example response

{
    "status": "Success",
    "response": {
        "transcript": "Could produce large hail isolated tornadoes and heavy rain.",
        "confidence": 0.87,
        "words": [
            {
                "word": "could",
                "start": 2.4,
                "end": 2.8,
                "confidence": 0.804
            },
            {
                "word": "produce",
                "start": 2.8,
                "end": 3.12,
                "confidence": 0.965
            },
            {
                "word": "large",
                "start": 3.12,
                "end": 3.44,
                "confidence": 0.859
            },
            {
                "word": "hail",
                "start": 3.6,
                "end": 3.92,
                "confidence": 0.812
            },
            {
                "word": "isolated",
                "start": 4.16,
                "end": 4.48,
                "confidence": 0.841
            },
            {
                "word": "tornadoes",
                "start": 4.56,
                "end": 5.2,
                "confidence": 0.897
            },
            {
                "word": "and",
                "start": 5.2,
                "end": 5.36,
                "confidence": 0.979
            },
            {
                "word": "heavy",
                "start": 5.44,
                "end": 5.76,
                "confidence": 0.867
            },
            {
                "word": "rain",
                "start": 5.84,
                "end": 5.92,
                "confidence": 0.904
            }
        ],
        "audio_duration": 7.096599
    }
}
Parameter Type Description
speakerCount Number The number of speakers detected. Optional. Field is set only when enableSpeakerDiarization is true.
words List List of word segments (see below).
transcript String The entire transcript with/without punctuations according to the input.
confidence Number Overall transcription confidence.

Word Segment

Parameter Type Description
speakerId String The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true.
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
word String The word corresponding to the audio segment.
confidence Number Confidence score for the word.