Features

The following features are available in the Asynchronous Speech-to-Text and Streaming Speech-to-Text APIs.

Custom vocabularies

To improve accuracy of the ASR when using words or terms that are not in the average English dictionary, submit these words as custom vocabulary.

Custom vocabularies are submitted as a list of phrases. A phrase can be one word or multiple words, usually describing a single object or concept.

Here is an example of submitting a custom vocabulary to the API containing the made-up word sparkletini:

Copy
Copied
curl -X POST "https://api.rev.ai/speechtotext/v1/vocabularies" \
    -H "Authorization: Bearer <REVAI_ACCESS_TOKEN>" \
    -H "Content-Type: application/json" \
    -d '{ "custom_vocabularies": [{ "phrases": ["sparkletini"] }] }'
attention

Punctuation and inverse text normalization

Rev AI automatically adds punctuation and performs inverse text normalization on all audio processed. Inverse text normalization or ITN is the process of converting spoken-form text to written-form text. This includes dates, times and phone numbers.

Examples:

  • Dates: "June twentieth twenty twenty" becomes "June 20th, 2020"
  • Phone numbers - "one two three one two three one two three four" becomes "(123)123-1234"

ITN is performed on all audio submitted to the Asynchronous Speech-to-Text API. For audio submitted to the Streaming Speech-to-Text API, ITN is only performed on Final Hypotheses.

Here is an example of a transcript containing punctuation:

Copy
Copied
{
  "monologues": [
    {
      "speaker": 1,
      "elements": [
        {
          "type": "text",
          "value": "Hello",
          "ts": 0.5,
          "end_ts": 1.5,
          "confidence": 1
        },
        {
          "type": "punct",
          "value": " "
        },
        {
          "type": "text",
          "value": "World",
          "ts": 1.75,
          "end_ts": 2.85,
          "confidence": 0.8
        },
        {
          "type": "punct",
          "value": "."
        }
      ]
    },
    {
      ...
    }
  ]
}
attention

Learn more about punctuation control in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.

Disfluency or filler word removal

Disfluencies can be distracting because they break the flow of speech. This is especially true for written text. The APIs currently only filter for "ums" and "uhs" but when this setting is enabled, disfluencies will not appear in the transcription output.

attention

Learn more about disfluency removal in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.

Profanity filtering

The current profanity dictionary contains approximately 600 profane words and phrases. When this feature is enabled, all the words transcribed that are included on this list will be displayed as asterisks except the first and last character.

attention

Learn more about profanity filtering in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.

Timestamps

The JSON transcription output includes timestamps for every transcribed word. Timestamps correspond to when the words are spoken within the audio and can be used for alignment, analytics, live captions, etc.

Here is an example of a transcript with timestamps:

Copy
Copied
{
  "monologues": [
    {
      "speaker": 0,
      "elements": [
        {
          "type": "text",
          "value": "Hi",
          "ts": 0.27,
          "end_ts": 0.48,
          "confidence": 1
        },
        {
          "type": "text",
          "value": "my",
          "ts": 0.51,
          "end_ts": 0.66,
          "confidence": 1
        },
        {
          "type": "text",
          "value": "name's",
          "ts": 0.66,
          "end_ts": 0.84,
          "confidence": 0.84
        },
        {
          "type": "text",
          "value": "Jack",
          "ts": 0.84,
          "end_ts": 1.05,
          "confidence": 0.99
        },
        {
          ...
        }
      ]
    }
  ]
}
attention

Learn more about working with timestamps in the Asynchronous Speech-to-Text API Reference and the Streaming Speech-to-Text API Reference.

Speaker separation or diarization

Speaker diarization is the process of separating audio segments according to speaker identification. Diarization is performed by default on all audio processed through the Asynchronous Speech-to-Text API. When multiple speakers are detected, transcription output will be separated by speaker.

Here is an example of a transcript with multiple speakers:

Copy
Copied
{
  "monologues": [
    {
      "speaker": 0,
      "elements": [
        {
          "type": "text",
          "value": "Hi",
          "ts": 0.27,
          "end_ts": 0.48,
          "confidence": 1
        },
        ...
      ]
    },
    {
      "speaker": 1,
      "elements": [
        {
          "type": "text",
          "value": "Although",
          "ts": 3.14,
          "end_ts": 3.56,
          "confidence": 1
        },
        ...
      ]
    }
  ]
}
attention

Learn more about speaker diarization in the Asynchronous Speech-to-Text API Reference and Best Practices Guide.