Text transcription
Speech-to-text, automatic speech recognition, or transcription is the process of converting the audio contents of a media file into structured breakdown of what was said, by whom, and when. In addition, the API has the ability to automatically modify the transcribed text with proper punctuation for easier reading and consumption. If speaker diarization is enabled, then the response will also contain a speaker count and associate each spoken word with the most probably speaker.
The Speech-to-Text API is best used in tandem with speaker enrollment to help identify the speakers in a media file. Speaker identification relies upon the developer to pass to the API a list of potential speakers so that they can be identified.
English is currently the only supported language.
Transcribing speech to text in media files
Request parameters
Parameter | Type | Description |
---|---|---|
encoding |
String | Encoding of audio file like MP3, WAV etc. |
languageCode |
String | Language spoken in the audio file. Default of "en-US". |
contentUri |
String | Publicly facing url. |
audioType |
String | Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting , EarningsCalls , Interview , PressConference |
source |
String | The source for the audio file: Webex, Zoom, GotoMeeting, Phone. Optional. The value will be used if enableSpeakerDiarization is set to True . |
speakerCount |
Number | Number of speakers in the file. Set to -1 (default) if there are an unknown number of speakers. Optional. The value will be used if enableSpeakerDiarization is set to True . |
speakerIds |
List[String] | Optional set of speakers to be identified. Optional. The value will be used if enableSpeakerDiarization is set to True . |
enableVoiceActivityDetection |
Boolean | Apply voice activity detection. Optional. Default of False . The value will be used if enableSpeakerDiarization is set to True . |
enablePunctuation |
Boolean | Enables RingCentral's Smart Punctuation API. Optional. Default of True . |
enableSpeakerDiarization |
Boolean | Tags each word corresponding to the speaker. Optional. Default of False . |
separateSpeakerPerChannel |
Boolean | Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False . The value will be used if enableSpeakerDiarization is set to True . |
source |
String | Source of the audio file eg: Phone , RingCentral , GoogleMeet , Zoom etc. Optional. |
-
The
audioType
parameter provides the system with a hint about the nature of the meeting which helps improve accuracy. We recommend setting this parameter toCallCenter
when there are 2-3 speakers expected to be identified andMeeting
when 4-6 speakers are expected. -
Set the
enableVoiceActivityDetection
parameter toTrue
if you want silence and noise segments removed from the diarization output. We suggest you to set it toTrue
in most circumstances. -
Setting the
source
parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.
Example code
After you have setup a simple web server to process the response, copy and paste the code from below in index.js
and make sure to edit the variables in ALL CAPS to ensure your code runs properly.
const RC = require('@ringcentral/sdk').SDK;
require('dotenv').config();
MEDIA_URL = process.env.RC_MEDIA_URL;
WEBHOOK_URL = '<INSERT YOUR WEBHOOK URL>';
// Initialize the RingCentral SDK and Platform
const rcsdk = new RC({
'server': process.env.RC_SERVER_URL,
'clientId': process.env.RC_CLIENT_ID,
'clientSecret': process.env.RC_CLIENT_SECRET
});
const platform = rcsdk.platform();
// Authenticate with RingCentral Developer Platdorm using Developer's JWT Credential
platform.login({
'jwt': process.env.RC_JWT
});
// Call the Speech to Text API right after login asynchronously
platform.on(platform.events.loginSuccess, () => {
speechToText();
})
async function speechToText() {
try {
console.log("Calling RingCentral Speech To Text API");
let resp = await platform.post("/ai/audio/v1/async/speech-to-text?webhook=" + WEBHOOK_ADDRESS, {
"contentUri": MEDIA_URL,
"encoding": "Wav",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"enablePunctuation": true,
"enableSpeakerDiarization": false
});
console.log("Job is " + resp.statusText + " with HTTP status code " + resp.status);
}
catch (e) {
console.log("An Error Occurred : " + e.message);
}
}
Run your sample code.
$ node index.js
import os,sys
import logging
import requests
from ringcentral import SDK
from dotenv import load_dotenv
from http.server import BaseHTTPRequestHandler, HTTPServer
# Load Enviroment variables
load_dotenv()
# Handle Incoming HTTP requests
class S(BaseHTTPRequestHandler):
def _set_response(self):
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
def do_POST(self):
content_length = int(self.headers['Content-Length']) # <--- Gets the size of data
post_data = self.rfile.read(content_length) # <--- Gets the data itself
if self.path == '/webhook':
print(post_data)
self._set_response()
# Invoke Speech to Text API
def speechToText():
# Endpoint to invoke Speech to Text API
endpoint = os.getenv('RC_SERVER_URL')+"/ai/audio/v1/async/speech-to-text"
querystring = {"webhook":os.getenv('WEBHOOK_ADDRESS')}
# Payload
payload = {
"contentUri": "https://github.com/suyashjoshi/ringcentral-ai-demo/blob/master/public/audio/sample1.wav?raw=true",
"encoding": "Wav",
"languageCode": "en-US",
"source": "RingCentral",
"audioType": "Meeting",
"enablePunctuation": True,
"enableSpeakerDiarization": False
}
try:
# Instantiate Ringcentral SDK
rcsdk = SDK( os.getenv('RC_CLIENT_ID'),os.getenv('RC_CLIENT_SECRET'),os.getenv('RC_SERVER_URL'))
platform = rcsdk.platform()
# Login Using JWT
platform.login( jwt=os.getenv('RC_JWT') );
# Make HTTP POST call to the Speech to Text endpoint with the query string and payload
response = platform.post(endpoint, payload, querystring);
print(response.json());
except Exception as e:
print(e)
# Create HTTP server to listen on the defined port
def run(server_class=HTTPServer, handler_class=S, port=8080):
logging.basicConfig(level=logging.INFO)
server_address = ('', port)
httpd = server_class(server_address, handler_class)
logging.info('Starting httpd...\n')
try:
httpd.serve_forever()
except KeyboardInterrupt:
pass
httpd.server_close()
logging.info('Stopping httpd...\n')
try:
speechToText()
run()
except Exception as e:
print(e)
You are almost done. Now run your script to make the request and receive the response.
$ python3 app.py
Example response
{
"status": "Success",
"response": {
"transcript": "Could produce large hail isolated tornadoes and heavy rain.",
"confidence": 0.87,
"words": [
{
"word": "could",
"start": 2.4,
"end": 2.8,
"confidence": 0.804
},
{
"word": "produce",
"start": 2.8,
"end": 3.12,
"confidence": 0.965
},
{
"word": "large",
"start": 3.12,
"end": 3.44,
"confidence": 0.859
},
{
"word": "hail",
"start": 3.6,
"end": 3.92,
"confidence": 0.812
},
{
"word": "isolated",
"start": 4.16,
"end": 4.48,
"confidence": 0.841
},
{
"word": "tornadoes",
"start": 4.56,
"end": 5.2,
"confidence": 0.897
},
{
"word": "and",
"start": 5.2,
"end": 5.36,
"confidence": 0.979
},
{
"word": "heavy",
"start": 5.44,
"end": 5.76,
"confidence": 0.867
},
{
"word": "rain",
"start": 5.84,
"end": 5.92,
"confidence": 0.904
}
],
"audio_duration": 7.096599
}
}
Parameter | Type | Description |
---|---|---|
speakerCount |
Number | The number of speakers detected. Optional. Field is set only when enableSpeakerDiarization is true . |
words |
List | List of word segments (see below). |
transcript |
String | The entire transcript with/without punctuations according to the input. |
confidence |
Number | Overall transcription confidence. |
Word Segment
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true . |
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
word |
String | The word corresponding to the audio segment. |
confidence |
Number | Confidence score for the word. |