Speech to text transcription
Speech-to-text is the process of converting speech content into text. RingCentral uses advanced machine learning algorithms to transcribe speech to text and further process the text contents to provide rich transcription with punctuations, number of speakers and conversational utterances with useful properties such as speaker id, timestamps of every utterance and of every spoken word.
The Speech-to-text API also supports speaker recognition if you have trained the voice signature of the speakers using the Speaker id enrollment API. Speaker recognition relies on the API speakerIds
input as list of pre-enrolled speaker ids of the potential speakers in the conversation.
English is currently the only supported language.
Transcribing speech to text in media files
Request parameters
Parameter | Type | Description |
---|---|---|
encoding |
String | Encoding of audio file like MP3, WAV etc. |
languageCode |
String | Language spoken in the audio file. Default of "en-US". |
contentUri |
String | Publicly accessible url of a media content. |
audioType |
String | Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting , EarningsCalls , Interview , PressConference |
source |
String | The source for the audio file: Webex, Zoom, GotoMeeting, Phone. Optional. The value will be used if enableSpeakerDiarization is set to True . |
speakerCount |
Number | Number of speakers in the file. Set to -1 (default) if there are an unknown number of speakers. Optional. The value will be used if enableSpeakerDiarization is set to True . |
speakerIds |
List[String] | A list of speakers to be identified. See speaker enrollment section for more details. Optional. The value will be used if enableSpeakerDiarization is set to True . |
enableVoiceActivityDetection |
Boolean | Apply voice activity detection. Optional. Default of False . The value will be used if enableSpeakerDiarization is set to True . |
enablePunctuation |
Boolean | Enables RingCentral's Smart Punctuation API. Optional. Default of True . |
enableSpeakerDiarization |
Boolean | Tags each word corresponding to the speaker. Optional. Default of False . |
separateSpeakerPerChannel |
Boolean | Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False . The value will be used if enableSpeakerDiarization is set to True . |
source |
String | Source of the audio file eg: Phone , RingCentral , GoogleMeet , Zoom etc. Optional. |
-
The
audioType
parameter provides the system with a hint about the nature of the audio conversations which helps improve accuracy. We recommend setting this parameter toCallCenter
when there are 2-3 speakers expected to be identified andMeeting
when 4-6 speakers are expected. -
Set the
enableVoiceActivityDetection
parameter toTrue
if you want silence and noise segments removed from the diarization output. We suggest you to set it toTrue
in most circumstances. -
Setting the
source
parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources. -
If you specify the
speakerIds
parameter, make sure that all the speaker ids in the array exist. Otherwise, the API call will fail. As a good practice, you can always read the speaker ids from your account and use the correct ids of the speakers, who you think that might speak in the audio file.
Example code
Try out the AI Quick Start Guide
Sample response
The response data differs based on the API input parameters. For instance, if the enableSpeakerDiarization
flag is set to false, the response will not include the speaker id info and the utterances segment will be omitted. This will also speed up the transcription processing time. Therefore, if you need to transcribe a voicemail recording, you should set the enableSpeakerDiarization
to false.
{
"jobId": "c8b1bd02-af17-11ee-93fb-0050568c76a9",
"api": "/ai/audio/v1/async/speech-to-text",
"creationTime": "2024-01-09T17:51:58.422Z",
"completionTime": "2024-01-09T17:56:26.126Z",
"expirationTime": "2024-01-16T17:51:58.422Z",
"status": "Success",
"response": {
"confidence": 0.9,
"transcript": "This call is now being recorded. Parker Scarves, how may I help you? I bought a scarf on line for my whites. And it turns out they shipped the wrong color. Oh, I am so sorry, sir. I get it for birthday, which is tonight. And now I am not a 100 % sure what I need to do. Okay, let me see if I can help you. Do you have the item number of the Parker scars? I do not I do not think so. It is called a New Yorker, I think. Excellent, okay. What color did you want The New Yorker in blue, the 1 they shipped was light blue. I wanted the darker 1. Did you want Navy Blue or Royal Blue? What is the difference there? The royal blue is a bit brighter. That is the 1 I want, okay? What zip code are you located in? 1946. It appears that we do not I am sorry that we do have that item in stock at Karen's boutique at the Hunter Mall. Is that close by? It is it is primary office. Okay, what is your name, sir? Charlie Johnson, Charlie Johnson, is that J O H N S O N? Yes, Ma'am and Mr Johnson, do you have the Parker scarf in light blue with you now? I do, they shipped it to my office. It just came in not that long ago, okay? What I will do is make arrangements with Karen's to take for you to exchange the Parker scarf at no additional cost. And in addition, I was able to look up your order in our system. And I am going to send out a special gift to you to make up for the inconvenience. Excellent, thank you so much, you are welcome and thank you for calling Parker scarf, and I hope your wife enjoys your birthday gift. Thank you. Thank you very much. You are very welcome. Goodbye, bye bye.",
"utterances": [
{
"confidence": 0.87,
"end": 4.800000000000001,
"speakerId": "0",
"start": 0.16,
"text": "This call is now being recorded. Parker Scarves. How may I help you?",
"wordTimings": [
{
"confidence": 0.87,
"end": 0.24,
"speakerId": "0",
"start": 0.16,
"word": "this"
},
{
"confidence": 0.87,
"end": 0.48,
"speakerId": "0",
"start": 0.4,
"word": "call"
},
{
"confidence": 0.87,
"end": 0.72,
"speakerId": "0",
"start": 0.64,
"word": "is"
},
...
]
},
{
"confidence": 0.87,
"end": 9.78,
"speakerId": "1",
"start": 4.800000000000001,
"text": "I bought a scarf on line for my whites, and it turns out they shipped the wrong color.",
"wordTimings": [
{
"confidence": 0.87,
"end": 5.36,
"speakerId": "1",
"start": 4.800000000000001,
"word": "i"
},
...
]
},
...
],
"words": [
{
"confidence": 0.87,
"end": 0.24,
"start": 0,
"word": "this"
},
{
"confidence": 0.87,
"end": 0.48,
"start": 0.4,
"word": "call"
},
{
"confidence": 0.87,
"end": 0.72,
"start": 0.64,
"word": "is"
},
...
]
}
}
Parameter | Type | Description |
---|---|---|
speakerCount |
Number | The number of speakers detected. Optional. Field is set only when enableSpeakerDiarization is true . |
words |
List | List of word segments (see below). |
transcript |
String | The entire transcript with/without punctuations according to the input. |
confidence |
Number | Overall transcription confidence. |
utterances |
list | List of utterances |
Word Segment
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true . |
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
word |
String | The word corresponding to the audio segment. |
confidence |
Number | Confidence score for the word. |
Utterances Segment
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true . |
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
text |
String | The utterance. |
confidence |
Number | Confidence score for the word. |
wordTimings |
List | List of spoken words within this utterance |
WordTimings Segment
Parameter | Type | Description |
---|---|---|
speakerId |
String | The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true . |
confidence |
Number | Confidence score for the word. |
start |
Number | Start time of the audio segment in seconds. |
end |
Number | End time of the audio segment in seconds. |
word |
String | The spoken word. |