Requests

There are two ways to interact with the Streaming Speech-to-Text API:

  • WebSocket protocol
  • RTMP streams

All connections to this API start as either a WebSocket handshake HTTP request (for WebSocket streams) or an HTTP POST request (for RTMP streams).

On successful authorization, the client can start sending binary WebSocket messages containing audio data or an RTMP audio stream in one of our supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.

Request parameters

A WebSocket request to the Streaming Speech-to-Text API consists of the following parts:

Request parameter Required Default
Base URL (WebSocket) or read_url URL (RTMP) Yes
Access Token access_token Yes None
Content Type content_type Yes None
Language language No en
Metadata metadata No None
Custom Vocabulary custom_vocabulary_id No None
Profanity Filter filter_profanity No false
Disfluencies remove_disfluencies No false
Delete After Seconds delete_after_seconds No None
Detailed Partials detailed_partials No false
Start Timestamp start_ts No None
Maximum segment duration seconds max_segment_duration_seconds No None
Transcriber transcriber No See transcriber section
Speaker switch detection enable_speaker_switch No false
Skip Post-processing skip_postprocessing No false
Priority priority No speed
Maximum wait time for connection max_connection_wait_seconds No 60

Access token

Clients must authenticate by including their Rev AI access token as a query parameter in their requests. If access_token is invalid or the query parameter is not present, the WebSocket connection will be closed with code 4001.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1

Content type

All requests must also contain a content_type query parameter. The content type describes the format of audio data being sent. If you are submitting raw audio, Rev AI requires extra parameters as shown below. If the content type is invalid or not set, the WebSocket connection is closed with a 4002 close code.

Rev AI officially supports these content types:

RAW file content type

You are required to provide additional information in content_type when content_type is audio/x-raw.

Parameter (type) Description Allowed Values Required
layout (string) The layout of channels within a buffer. Possible values are "interleaved" (for LRLRLRLR) and "non-interleaved" (LLLLRRRR). Not case-sensitive interleaved,non-interleaved audio/x-raw only
rate (int) Sample rate of the audio bytes Inclusive Range from 8000-48000Hz audio/x-raw only
format (string) Format of the audio samples. Case-sensitive. See Allowed Values column for valid values List of valid formats audio/x-raw only
channels (int) Number of audio channels that the audio samples contain Inclusive range from 1-10 channels audio/x-raw only

These parameters follow the content_type, delimited by semi-colons (;). Each parameter should be specified in the format parameter_name=parameter_value.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>

Language

attention

Custom Prices (other than the default) are set independently by language. Please refer to your contract for pricing information. If you are not a contract customer the pricing is found here

Specify the transcription language with the language query parameter. When the language is not provided, transcription will default to English. The language query parameter cannot be used along with the following options: filter_profanity, remove_disfluencies, and custom_vocabulary_id.

Language Language Code
English en
French fr
German de
Italian it
Japanese ja
Korean ko
Mandarin cmn
Portuguese pt
Spanish es

Additional requirements for content type:

  • content_type must be audio/x-raw or audio/x-flac
  • when providing raw audio, it must be formatted as S16LE
  • rate must be included, regardless of content_type

Examples

Copy
Copied
# WebSocket protocol with audio/x-raw
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&language=<LANGUAGE_CODE>

# WebSocket protocol with audio/x-flac
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-flac;rate=16000&language=<LANGUAGE_CODE>

Metadata

Metadata to be associated with the job may be provided via the metadata query parameter.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>

Custom vocabulary

If your streaming session contains domain-specific words or phrases that may not be in our dictionary, you can improve the accuracy of your transcript by creating a custom vocabulary to include with your streaming session.

The custom vocabulary id can be included as a query parameter in the request to the Streaming Speech-to-Text API. The API will now be able to recognize the new word every time it is spoken.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-wav&custom_vocabulary_id=cv5ZqltdU0lFA4
attention

Learn more about using a custom vocabulary.

Profanity filter

Optionally, you can filter profanity by adding the filter_profanity=true query parameter to your streaming request. This setting is disabled by default. We filter approximately 600 profanities, which covers most use cases. If a transcribed word matches a word on this list, all characters of the word will be replaced by asterisks except for the first and last character.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&filter_profanity=true

Here is an example partial hypothesis response with filtered profanity:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 1.43,
  "elements": [
    {
      "type": "text",
      "value": "a*****e"
    }
  ],
}

Disfluencies

You can choose to remove disfluencies from showing up in the resulting transcript by adding the remove_disfluencies=true query parameter to your streaming request. This setting is optional and defaults to false.

attention

Currently, disfluencies are defined as 'ums' and 'uhs'. When this remove_disfluencies is set to true, disfluencies will not appear in either partial or final hypothesis.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&remove_disfluencies=true

Here is an example partial hypothesis response with disfluencies:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 3.24,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "uh"
    },
    {
      "type": "text",
      "value": "finds"
    },
    {
      "type": "text",
      "value": "a"
    },
    {
      "type": "text",
      "value": "way"
    }
  ],
}

Here is the partial hypothesis response for the same audio without disfluencies:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 3.24,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "finds"
    },
    {
      "type": "text",
      "value": "a"
    },
    {
      "type": "text",
      "value": "way"
    }
  ]
}

Delete after seconds

You can specify how many seconds after completion the job should be deleted, by adding delete_after_seconds query parameter to your streaming request. The number of seconds provided must range from 0 seconds to 2592000 seconds (30 days).

It may take up to 2 minutes after the scheduled time for the job to be deleted unless delete_after_seconds = 0. When delete_after_seconds = 0, the audio and transcript are immediately deleted and are never stored.

attention

The delete_after_seconds parameter is optional but, when provided, it overrides the auto delete preference set on your Rev AI account.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&delete_after_seconds=SECONDS

Detailed partials

You can choose to receive timestamps and confidence scores in the partial hypotheses by adding the query parameter detailed_partials=true. This setting is disabled by default.

The detailed_partials parameter enables usage of values that are otherwise only available in the final hypotheses. For example:

  • Timestamps in partials can be used to display transcribed words earlier.
  • Confidence scores offer the option to write custom logic for displaying the transcribed words.
warning

When using detailed_partials, you can expect a slight degradation of 1% in word error rate for the final hypothesis.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&detailed_partials=true

Here is an example partial hypothesis response without additional details:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 2.18,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "begins"
    }
  ]
}

Here is an example partial hypothesis response for the same response with additional details:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 2.18,
  "elements": [
    {
      "type": "text",
      "value": "life",
      "ts": 0.0,
      "end_ts": 1.5,
      "confidence": 0.83
    },
    {
      "type": "text",
      "value": "begins",
      "ts": 1.5,
      "end_ts": 1.98,
      "confidence": 0.7
    }
  ]
}

Start timestamp

You can provide a starting timestamp to offset all hypotheses timings by adding start_ts as a query parameter to the request. If provided, all output hypotheses will have their ts and end_ts offset by the amount of seconds provided in the start_ts parameter.

The start_ts parameter must be a positive double value.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&start_ts=60.5

Here is an example response without a specified start timestamp:

Copy
Copied
{
  "type": "final",
  "ts": 1.01,
  "end_ts": 3.2,
  "elements": [
    {
      "type": "text",
      "value": "One",
      "ts": 1.04,
      "end_ts": 1.55,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "two",
      "ts": 1.84,
      "end_ts": 2.15,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": "."
    }
  ]
}

Here is an example response without the start timestamp set to 60.5:

Copy
Copied
{
  "type": "final",
  "ts": 61.51,
  "end_ts": 63.7,
  "elements": [
    {
      "type": "text",
      "value": "One",
      "ts": 61.54,
      "end_ts": 62.05,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "two",
      "ts": 62.34,
      "end_ts": 62.65,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": "."
    }
  ]
}

Maximum segment duration seconds

You can provide a maximum time you are willing to wait between final hypotheses, by adding the max_segment_duration_seconds query parameter to your streaming request. The number of seconds provided must range from 5 seconds to 30 seconds.

attention

This parameter potentially changes the amount of context our engine has when creating final hypotheses and therefore has a minor effect on word error rate. Higher values correlate with fewer errors in transcription.

warning

The max_segment_duration_seconds is not exact. Due to words potentially falling on the boundary of final hypotheses, you should expect the length of final hypotheses to be potentially 0.5 seconds longer than your specified value. For example, if you specify max_segment_duration_seconds=5, final hypotheses may be up to 5.5 seconds in length.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&max_segment_duration_seconds=15

Output format doesn't change as a result of this parameter.

Transcriber

You can define a specific transcription model to use for your stream with the transcriber option.

The output format does not change, regardless of which transcriber is selected.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2

Speaker switch detection

warning

This option is only available for streams using the transcriber=machine_v2 option.

You can choose to enable speaker switch detection by adding enable_speaker_switch=true to your request. Doing so will add a new field named speaker_id to final hypotheses. Whenever the system detects that the active speaker in the audio has changed, the speaker_id field will be incremented to indicate this change.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&enable_speaker_switch=true

Here is an example response with speaker switch detection enabled:

Copy
Copied
{
  "type": "final",
  "ts": 1.01,
  "end_ts": 3.2,
  "speaker_id": 1000,
  "elements": [
    {
      "type": "text",
      "value": "One",
      "ts": 1.04,
      "end_ts": 1.55,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "two",
      "ts": 1.84,
      "end_ts": 2.15,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": "."
    }
  ]
}

Skip Post-processing

Only available for English and Spanish languages You can choose to skip post-processing operations, such as inverse text normalization (ITN), casing and punctuation, by adding skip_postprocessing=true to your request. Doing so will result in a small decrease in latency; however, your final hypotheses will no longer contain capitalization, punctuation, or inverse text normalization (for example, five hundred will not be normalized to 500).

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&skip_postprocessing=true

Here is an example response with skip_postprocessing enabled:

Copy
Copied
{
  "type": "final",
  "ts": 1.01,
  "end_ts": 3.2,
  "elements": [
    {
      "type": "text",
      "value": "five",
      "ts": 1.04,
      "end_ts": 1.55,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "hundred",
      "ts": 1.84,
      "end_ts": 2.15,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "dollars",
      "ts": 2.23,
      "end_ts": 2.87,
      "confidence": 1.0
    }
  ]
}

Priority

Only available for English and Spanish languages Only available for machine_v2 transcriber You can configure what our engine should prioritize for your stream. The available choices are speed and accuracy.

  • speed is the default and means the engine will prioritize creating partial and final hypotheses more frequently to reduce latency.
  • accuracy will cause the engine to produce results less frequently but will greatly increase the accuracy of results.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&priority=accuracy

Maximum wait time for connection

You can provide a maximum time you are willing to wait for a WebSocket connection to be available, by adding the max_connection_wait_seconds query parameter to your streaming request. The number of seconds provided must range from 60 seconds to 600 seconds.

To handle the rare occasion that the API is unable to allocate a speech worker immediately, set this value to something greater than 60 seconds. This will help in prioritizing and assigning the requests to a speech worker in the order in which the requests were received, and scale up faster to meet the total number of requested connections.

If this timeout is reached, the client will receive a WebSocket close message with error code 4013. This indicates that the API is unable to service the connection within the specified wait time and that the client should try again later.

attention

An increased connection latency does not mean an increased partial latency. Once connected, partial latencies will behave as normal. The client can buffer audio data during the connection wait period and send it as a burst of audio data messages after the connection to a worker is established. Audio will be processed faster than real-time and eventually partials will catch up to the client in real-time. The catch-up period is dependent on the amount of audio buffered, which is often a function of the buffered audio duration (the connection wait period).

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=int&max_connection_wait_seconds=300

Request stages

RTMP streams

Initial connection

All requests begin as an HTTP POST request. An RTMP streaming session is requested with the user’s access token as a Bearer authentication token. The request body is a JSON document with streaming job options.

Copy
Copied
Client --> Rev AI
POST /speechtotext/v1/live_stream/rtmp HTTP/1.1
Host: api.rev.ai
Accept: */*
Authorization: Bearer <REVAI_ACCESS_TOKEN>
Content-Type: application/json
Content-Length: 19

{
  "metadata":"test",
  "filter_profanity": "true",
  "detailed_partials": "true",
  "remove_disfluencies": "true",
  "language": "fr",
  "transcriber": "machine",
  "max_segment_duration_seconds": 20,
  "custom_vocabulary_id": "cv5ZqltdU0lFA4",
  "enable_speaker_switch": false,
  "skip_postprocessing": false,
  "priority": "accuracy"
}

If authorization is successful, the client receives a JSON response which looks like the example response below:

Copy
Copied
{
  "ingestion_url":"rtmps://rtmp.rev.ai/streaming/v1",
  "stream_name":"wt=JFynCno&md=test&fp=True&rd=True&dp=True&cw=60",
  "read_url":"wss://api.rev.ai/speechtotext/v1/read_stream?read_token=ZZ-132"
}

WebSocket reader connection

A WebSocket connection must be opened to the read_url URL in order to receive streaming results. Once this connection is opened, results from your RTMP audio stream will be generated in the same response format as that for WebSocket streams.

warning

The WebSocket connection should be opened before the RTMP connection is opened in order to ensure no audio data is skipped.

RTMP audio submission

The RTMP audio can now be streamed to the ingestion_url URL endpoint provided in the initial response using the provided stream_name as the stream name for that session.

For example, to stream RTMP audio using ffmpeg, use the following command, replacing the <AUDIO_FILE> placeholder with the path to your audio file and the <INGESTION_URL> and <STREAM_NAME> placeholders with the values from the initial JSON response.

Copy
Copied
ffmpeg -re -i <AUDIO_FILE> -c copy -f flv <INGESTION_URL>/<STREAM_NAME>
attention

When using ffmpeg, escape the & characters within the <STREAM_NAME> placeholder by placing a backslash (\) before each of them.

WebSocket protocol

Initial connection

All requests begin as an HTTP GET request. A WebSocket request is declared by including the header value Upgrade: websocket and Connection: Upgrade.

Copy
Copied
Client --> Rev AI
GET /speechtotext/v1/stream HTTP/1.1
Host: api.rev.ai
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: Chxzu/uTUCmjkFH9d/8NTg==
Sec-WebSocket-Version: 13
Origin: http://api.rev.ai

If authorization is successful, the request is upgraded to a WebSocket connection.

Copy
Copied
Client <-- Rev AI
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: z0pcAwXZZRVlMcca8lmHCPzvrKU=

After the connection has been upgraded, the servers will return a "connected" message. You must wait for this connected message before sending binary audio data. The response includes an id, which is the corresponding job identifier, as shown in the example below:

Copy
Copied
{
    "type": "connected",
    "id": s1d24ax2fd21
}
warning

If Rev AI currently does not have the capacity to handle the request, a WebSocket close message is returned with status code of 4013. A HTTP/1.1 400 Bad Request response indicates that the request is not a WebSocket upgrade request.

Audio submission

WebSocket messages sent to Rev AI must be of one of these two WebSocket message types:

Message type Message requirements Notes
Binary Audio data is transmitted as binary data and should be sent in chunks of 250ms or more.

Streams sending audio chunks that are less than 250ms in size may experience increased transcription latency.
The format of the audio must match that specified in the content_type parameter.
Text The client should send an End-Of-Stream("EOS") text message to signal the end of audio data, and thus gracefully close the WebSocket connection.

On an EOS message, Rev AI will return a final hypothesis along with a WebSocket close message.
Currently, only this one text message type is supported.

WebSocket close type messages are explicitly not supported as a message type and will abruptly close the socket connection with a 1007 Invalid Payload error. Clients will not receive their final hypothesis in this case.

Any other text messages, including incorrectly capitalized messages such as "eos" and "Eos", are invalid and will also close the socket connection with a 1007 Invalid Payload error.
attention

See examples of the sequence of messages between a client and the Streaming Speech-to-Text API.