Speech to text transcription

Last updated: 2024-02-08Contributors
Edit this page

Speech-to-text is the process of converting speech content into text. RingCentral uses advanced machine learning algorithms to transcribe speech to text and further process the text contents to provide rich transcription with punctuations, number of speakers and conversational utterances with useful properties such as speaker id, timestamps of every utterance and of every spoken word.

The Speech-to-text API also supports speaker recognition if you have trained the voice signature of the speakers using the Speaker id enrollment API. Speaker recognition relies on the API speakerIds input as list of pre-enrolled speaker ids of the potential speakers in the conversation.

English is currently the only supported language.

Transcribing speech to text in media files

Request parameters

Parameter Type Description
encoding String Encoding of audio file like MP3, WAV etc.
languageCode String Language spoken in the audio file. Default of "en-US".
contentUri String Publicly accessible url of a media content.
audioType String Type of the audio based on number of speakers. Optional. Permitted values: CallCenter (default), Meeting, EarningsCalls, Interview, PressConference
source String The source for the audio file: Webex, Zoom, GotoMeeting, Phone. Optional. The value will be used if enableSpeakerDiarization is set to True.
speakerCount Number Number of speakers in the file. Set to -1 (default) if there are an unknown number of speakers. Optional. The value will be used if enableSpeakerDiarization is set to True.
speakerIds List[String] A list of speakers to be identified. See speaker enrollment section for more details. Optional. The value will be used if enableSpeakerDiarization is set to True.
enableVoiceActivityDetection Boolean Apply voice activity detection. Optional. Default of False. The value will be used if enableSpeakerDiarization is set to True.
enablePunctuation Boolean Enables RingCentral's Smart Punctuation API. Optional. Default of True.
enableSpeakerDiarization Boolean Tags each word corresponding to the speaker. Optional. Default of False.
separateSpeakerPerChannel Boolean Set to True if the input audio is multi-channel and each channel has a separate speaker. Optional. Default of False. The value will be used if enableSpeakerDiarization is set to True.
source String Source of the audio file eg: Phone, RingCentral, GoogleMeet, Zoom etc. Optional.
  • The audioType parameter provides the system with a hint about the nature of the audio conversations which helps improve accuracy. We recommend setting this parameter to CallCenter when there are 2-3 speakers expected to be identified and Meeting when 4-6 speakers are expected.

  • Set the enableVoiceActivityDetection parameter to True if you want silence and noise segments removed from the diarization output. We suggest you to set it to True in most circumstances.

  • Setting the source parameter helps to optimize the diarization process by allowing a specialized acoustic model built specifically for the corresponding audio sources.

  • If you specify the speakerIds parameter, make sure that all the speaker ids in the array exist. Otherwise, the API call will fail. As a good practice, you can always read the speaker ids from your account and use the correct ids of the speakers, who you think that might speak in the audio file.

Example code

Try out the AI Quick Start Guide

Sample response

The response data differs based on the API input parameters. For instance, if the enableSpeakerDiarization flag is set to false, the response will not include the speaker id info and the utterances segment will be omitted. This will also speed up the transcription processing time. Therefore, if you need to transcribe a voicemail recording, you should set the enableSpeakerDiarization to false.

{
    "jobId": "c8b1bd02-af17-11ee-93fb-0050568c76a9",
    "api": "/ai/audio/v1/async/speech-to-text",
    "creationTime": "2024-01-09T17:51:58.422Z",
    "completionTime": "2024-01-09T17:56:26.126Z",
    "expirationTime": "2024-01-16T17:51:58.422Z",
    "status": "Success",
    "response": {
        "confidence": 0.9,
        "transcript": "This call is now being recorded. Parker Scarves, how may I help you? I bought a scarf on line for my whites. And it turns out they shipped the wrong color. Oh, I am so sorry, sir. I get it for birthday, which is tonight. And now I am not a 100 % sure what I need to do. Okay, let me see if I can help you. Do you have the item number of the Parker scars? I do not I do not think so. It is called a New Yorker, I think. Excellent, okay. What color did you want The New Yorker in blue, the 1 they shipped was light blue. I wanted the darker 1. Did you want Navy Blue or Royal Blue? What is the difference there? The royal blue is a bit brighter. That is the 1 I want, okay? What zip code are you located in? 1946. It appears that we do not I am sorry that we do have that item in stock at Karen's boutique at the Hunter Mall. Is that close by? It is it is primary office. Okay, what is your name, sir? Charlie Johnson, Charlie Johnson, is that J O H N S O N? Yes, Ma'am and Mr Johnson, do you have the Parker scarf in light blue with you now? I do, they shipped it to my office. It just came in not that long ago, okay? What I will do is make arrangements with Karen's to take for you to exchange the Parker scarf at no additional cost. And in addition, I was able to look up your order in our system. And I am going to send out a special gift to you to make up for the inconvenience. Excellent, thank you so much, you are welcome and thank you for calling Parker scarf, and I hope your wife enjoys your birthday gift. Thank you. Thank you very much. You are very welcome. Goodbye, bye bye.",
        "utterances": [
            {
                "confidence": 0.87,
                "end": 4.800000000000001,
                "speakerId": "0",
                "start": 0.16,
                "text": "This call is now being recorded. Parker Scarves. How may I help you?",
                "wordTimings": [
                    {
                        "confidence": 0.87,
                        "end": 0.24,
                        "speakerId": "0",
                        "start": 0.16,
                        "word": "this"
                    },
                    {
                        "confidence": 0.87,
                        "end": 0.48,
                        "speakerId": "0",
                        "start": 0.4,
                        "word": "call"
                    },
                    {
                        "confidence": 0.87,
                        "end": 0.72,
                        "speakerId": "0",
                        "start": 0.64,
                        "word": "is"
                    },
                    ...
                ]
            },
            {
                "confidence": 0.87,
                "end": 9.78,
                "speakerId": "1",
                "start": 4.800000000000001,
                "text": "I bought a scarf on line for my whites, and it turns out they shipped the wrong color.",
                "wordTimings": [
                    {
                        "confidence": 0.87,
                        "end": 5.36,
                        "speakerId": "1",
                        "start": 4.800000000000001,
                        "word": "i"
                    },
                    ...
                ]
            },
            ...
        ],
        "words": [
            {
                "confidence": 0.87,
                "end": 0.24,
                "start": 0,
                "word": "this"
            },
            {
                "confidence": 0.87,
                "end": 0.48,
                "start": 0.4,
                "word": "call"
            },
            {
                "confidence": 0.87,
                "end": 0.72,
                "start": 0.64,
                "word": "is"
            },
            ...
        ]
    }
}
Parameter Type Description
speakerCount Number The number of speakers detected. Optional. Field is set only when enableSpeakerDiarization is true.
words List List of word segments (see below).
transcript String The entire transcript with/without punctuations according to the input.
confidence Number Overall transcription confidence.
utterances list List of utterances

Word Segment

Parameter Type Description
speakerId String The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true.
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
word String The word corresponding to the audio segment.
confidence Number Confidence score for the word.

Utterances Segment

Parameter Type Description
speakerId String The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true.
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
text String The utterance.
confidence Number Confidence score for the word.
wordTimings List List of spoken words within this utterance

WordTimings Segment

Parameter Type Description
speakerId String The speaker id for the corresponding audio segment. Optional. Field is set only when enableSpeakerDiarization is true.
confidence Number Confidence score for the word.
start Number Start time of the audio segment in seconds.
end Number End time of the audio segment in seconds.
word String The spoken word.