Features
The following features are available in the Asynchronous Speech-to-Text and Streaming Speech-to-Text APIs.
Custom vocabularies
To improve accuracy of the ASR when using words or terms that are not in the average English dictionary, submit these words as custom vocabulary.
Custom vocabularies are submitted as a list of phrases. A phrase can be one word or multiple words, usually describing a single object or concept.
Here is an example of submitting a custom vocabulary to the API containing the made-up word sparkletini
:
curl -X POST "https://api.rev.ai/speechtotext/v1/vocabularies" \
-H "Authorization: Bearer <REVAI_ACCESS_TOKEN>" \
-H "Content-Type: application/json" \
-d '{ "custom_vocabularies": [{ "phrases": ["sparkletini"] }] }'
attention
Learn more in the Custom Vocabulary API Reference.
Punctuation and inverse text normalization
Rev AI automatically adds punctuation and performs inverse text normalization on all audio processed. Inverse text normalization or ITN is the process of converting spoken-form text to written-form text. This includes dates, times and phone numbers.
Examples:
- Dates: "June twentieth twenty twenty" becomes "June 20th, 2020"
- Phone numbers - "one two three one two three one two three four" becomes "(123)123-1234"
ITN is performed on all audio submitted to the Asynchronous Speech-to-Text API. For audio submitted to the Streaming Speech-to-Text API, ITN is only performed on Final Hypotheses.
Here is an example of a transcript containing punctuation:
{
"monologues": [
{
"speaker": 1,
"elements": [
{
"type": "text",
"value": "Hello",
"ts": 0.5,
"end_ts": 1.5,
"confidence": 1
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "World",
"ts": 1.75,
"end_ts": 2.85,
"confidence": 0.8
},
{
"type": "punct",
"value": "."
}
]
},
{
...
}
]
}
attention
Learn more about punctuation control in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.
Disfluency or filler word removal
Disfluencies can be distracting because they break the flow of speech. This is especially true for written text. The APIs currently only filter for "ums" and "uhs" but when this setting is enabled, disfluencies will not appear in the transcription output.
attention
Learn more about disfluency removal in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.
Profanity filtering
The current profanity dictionary contains approximately 600 profane words and phrases. When this feature is enabled, all the words transcribed that are included on this list will be displayed as asterisks except the first and last character.
attention
Learn more about profanity filtering in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.
Timestamps
The JSON transcription output includes timestamps for every transcribed word. Timestamps correspond to when the words are spoken within the audio and can be used for alignment, analytics, live captions, etc.
Here is an example of a transcript with timestamps:
{
"monologues": [
{
"speaker": 0,
"elements": [
{
"type": "text",
"value": "Hi",
"ts": 0.27,
"end_ts": 0.48,
"confidence": 1
},
{
"type": "text",
"value": "my",
"ts": 0.51,
"end_ts": 0.66,
"confidence": 1
},
{
"type": "text",
"value": "name's",
"ts": 0.66,
"end_ts": 0.84,
"confidence": 0.84
},
{
"type": "text",
"value": "Jack",
"ts": 0.84,
"end_ts": 1.05,
"confidence": 0.99
},
{
...
}
]
}
]
}
attention
Learn more about working with timestamps in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.
Speaker separation or diarization
Speaker diarization is the process of separating audio segments according to speaker identification. Diarization is performed by default on all audio processed through the Asynchronous Speech-to-Text API. When multiple speakers are detected, transcription output will be separated by speaker.
Here is an example of a transcript with multiple speakers:
{
"monologues": [
{
"speaker": 0,
"elements": [
{
"type": "text",
"value": "Hi",
"ts": 0.27,
"end_ts": 0.48,
"confidence": 1
},
...
]
},
{
"speaker": 1,
"elements": [
{
"type": "text",
"value": "Although",
"ts": 3.14,
"end_ts": 3.56,
"confidence": 1
},
...
]
}
]
}
attention
Learn more about speaker diarization in the Asynchronous Speech-to-Text API Reference and Best Practices Guide.