Speech to text transcription

Last updated: 2024-02-08

Contributors

Speech-to-text is the process of converting speech content into text. RingCentral uses advanced machine learning algorithms to transcribe speech to text and further process the text contents to provide rich transcription with punctuations, number of speakers and conversational utterances with useful properties such as speaker id, timestamps of every utterance and of every spoken word.

The Speech-to-text API also supports speaker recognition if you have trained the voice signature of the speakers using the Speaker id enrollment API. Speaker recognition relies on the API speakerIds input as list of pre-enrolled speaker ids of the potential speakers in the conversation.

English is currently the only supported language.

Transcribing speech to text in media files

Request parameters

Parameter	Type	Description
`encoding`	String	Encoding of audio file like MP3, WAV etc.
`languageCode`	String	Language spoken in the audio file. Default of "en-US".
`contentUri`	String	Publicly accessible url of a media content.
`audioType`	String	Type of the audio based on number of speakers. Optional. Permitted values: `CallCenter` (default), `Meeting`, `EarningsCalls`, `Interview`, `PressConference`
`source`	String	The source for the audio file: Webex, Zoom, GotoMeeting, Phone. Optional. The value will be used if `enableSpeakerDiarization` is set to `True`.
`speakerCount`	Number	Number of speakers in the file. Set to `-1` (default) if there are an unknown number of speakers. Optional. The value will be used if `enableSpeakerDiarization` is set to `True`.
`speakerIds`	List[String]	A list of speakers to be identified. See speaker enrollment section for more details. Optional. The value will be used if `enableSpeakerDiarization` is set to `True`.
`enableVoiceActivityDetection`	Boolean	Apply voice activity detection. Optional. Default of `False`. The value will be used if `enableSpeakerDiarization` is set to `True`.
`enablePunctuation`	Boolean	Enables RingCentral's Smart Punctuation API. Optional. Default of `True`.
`enableSpeakerDiarization`	Boolean	Tags each word corresponding to the speaker. Optional. Default of `False`.
`separateSpeakerPerChannel`	Boolean	Set to `True` if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of `False`. The value will be used if `enableSpeakerDiarization` is set to `True`.
`source`	String	Source of the audio file eg: `Phone`, `RingCentral`, `GoogleMeet`, `Zoom` etc. Optional.

The audioType parameter provides the system with a hint about the nature of the audio conversations which helps improve accuracy. We recommend setting this parameter to CallCenter when there are 2-3 speakers expected to be identified and Meeting when 4-6 speakers are expected.
Set the enableVoiceActivityDetection parameter to True if you want silence and noise segments removed from the diarization output. We suggest you to set it to True in most circumstances.
Setting the source parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.
If you specify the speakerIds parameter, make sure that all the speaker ids in the array exist. Otherwise, the API call will fail. As a good practice, you can always read the speaker ids from your account and use the correct ids of the speakers, who you think that might speak in the audio file.

Example code

Try out the AI Quick Start Guide

Sample response

The response data differs based on the API input parameters. For instance, if the enableSpeakerDiarization flag is set to false, the response will not include the speaker id info and the utterances segment will be omitted. This will also speed up the transcription processing time. Therefore, if you need to transcribe a voicemail recording, you should set the enableSpeakerDiarization to false.

{
    "jobId": "c8b1bd02-af17-11ee-93fb-0050568c76a9",
    "api": "/ai/audio/v1/async/speech-to-text",
    "creationTime": "2024-01-09T17:51:58.422Z",
    "completionTime": "2024-01-09T17:56:26.126Z",
    "expirationTime": "2024-01-16T17:51:58.422Z",
    "status": "Success",
    "response": {
        "confidence": 0.9,
        "transcript": "This call is now being recorded. Parker Scarves, how may I help you? I bought a scarf on line for my whites. And it turns out they shipped the wrong color. Oh, I am so sorry, sir. I get it for birthday, which is tonight. And now I am not a 100 % sure what I need to do. Okay, let me see if I can help you. Do you have the item number of the Parker scars? I do not I do not think so. It is called a New Yorker, I think. Excellent, okay. What color did you want The New Yorker in blue, the 1 they shipped was light blue. I wanted the darker 1. Did you want Navy Blue or Royal Blue? What is the difference there? The royal blue is a bit brighter. That is the 1 I want, okay? What zip code are you located in? 1946. It appears that we do not I am sorry that we do have that item in stock at Karen's boutique at the Hunter Mall. Is that close by? It is it is primary office. Okay, what is your name, sir? Charlie Johnson, Charlie Johnson, is that J O H N S O N? Yes, Ma'am and Mr Johnson, do you have the Parker scarf in light blue with you now? I do, they shipped it to my office. It just came in not that long ago, okay? What I will do is make arrangements with Karen's to take for you to exchange the Parker scarf at no additional cost. And in addition, I was able to look up your order in our system. And I am going to send out a special gift to you to make up for the inconvenience. Excellent, thank you so much, you are welcome and thank you for calling Parker scarf, and I hope your wife enjoys your birthday gift. Thank you. Thank you very much. You are very welcome. Goodbye, bye bye.",
        "utterances": [
            {
                "confidence": 0.87,
                "end": 4.800000000000001,
                "speakerId": "0",
                "start": 0.16,
                "text": "This call is now being recorded. Parker Scarves. How may I help you?",
                "wordTimings": [
                    {
                        "confidence": 0.87,
                        "end": 0.24,
                        "speakerId": "0",
                        "start": 0.16,
                        "word": "this"
                    },
                    {
                        "confidence": 0.87,
                        "end": 0.48,
                        "speakerId": "0",
                        "start": 0.4,
                        "word": "call"
                    },
                    {
                        "confidence": 0.87,
                        "end": 0.72,
                        "speakerId": "0",
                        "start": 0.64,
                        "word": "is"
                    },
                    ...
                ]
            },
            {
                "confidence": 0.87,
                "end": 9.78,
                "speakerId": "1",
                "start": 4.800000000000001,
                "text": "I bought a scarf on line for my whites, and it turns out they shipped the wrong color.",
                "wordTimings": [
                    {
                        "confidence": 0.87,
                        "end": 5.36,
                        "speakerId": "1",
                        "start": 4.800000000000001,
                        "word": "i"
                    },
                    ...
                ]
            },
            ...
        ],
        "words": [
            {
                "confidence": 0.87,
                "end": 0.24,
                "start": 0,
                "word": "this"
            },
            {
                "confidence": 0.87,
                "end": 0.48,
                "start": 0.4,
                "word": "call"
            },
            {
                "confidence": 0.87,
                "end": 0.72,
                "start": 0.64,
                "word": "is"
            },
            ...
        ]
    }
}

Parameter	Type	Description
`speakerCount`	Number	The number of speakers detected. Optional. Field is set only when `enableSpeakerDiarization` is `true`.
`words`	List	List of word segments (see below).
`transcript`	String	The entire transcript with/without punctuations according to the input.
`confidence`	Number	Overall transcription confidence.
`utterances`	list	List of utterances

Word Segment

Parameter	Type	Description
`speakerId`	String	The speaker id for the corresponding audio segment. Optional. Field is set only when `enableSpeakerDiarization` is `true`.
`start`	Number	Start time of the audio segment in seconds.
`end`	Number	End time of the audio segment in seconds.
`word`	String	The word corresponding to the audio segment.
`confidence`	Number	Confidence score for the word.

Utterances Segment

Parameter	Type	Description
`speakerId`	String	The speaker id for the corresponding audio segment. Optional. Field is set only when `enableSpeakerDiarization` is `true`.
`start`	Number	Start time of the audio segment in seconds.
`end`	Number	End time of the audio segment in seconds.
`text`	String	The utterance.
`confidence`	Number	Confidence score for the word.
`wordTimings`	List	List of spoken words within this utterance

WordTimings Segment

Parameter	Type	Description
`speakerId`	String	The speaker id for the corresponding audio segment. Optional. Field is set only when `enableSpeakerDiarization` is `true`.
`confidence`	Number	Confidence score for the word.
`start`	Number	Start time of the audio segment in seconds.
`end`	Number	End time of the audio segment in seconds.
`word`	String	The spoken word.