Requests

There are two ways to interact with the Streaming Speech-to-Text API:

  • WebSocket protocol
  • RTMP streams

All connections to this API start as either a WebSocket handshake HTTP request (for WebSocket streams) or an HTTP POST request (for RTMP streams).

On successful authorization, the client can start sending binary WebSocket messages containing audio data or an RTMP audio stream in one of our supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.

Request parameters

A WebSocket request to the Streaming Speech-to-Text API consists of the following parts:

Request parameter Required Default
Base URL (WebSocket) or read_url URL (RTMP) Yes
Access token access_token Yes None
Content type content_type Yes None
Language language No en
Metadata metadata No None
Custom vocabulary custom_vocabulary_id No None
Profanity filter filter_profanity No false
Disfluencies remove_disfluencies No false
Delete after seconds delete_after_seconds No None
Detailed partials detailed_partials No false
Start timestamp start_ts No None

Access token

Clients must authenticate by including their Rev AI access token as a query parameter in their requests. If access_token is invalid or the query parameter is not present, the WebSocket connection will be closed with code 4001.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Content type

All requests must also contain a content_type query parameter. The content type describes the format of audio data being sent. If you are submitting raw audio, Rev AI requires extra parameters as shown below. If the content type is invalid or not set, the WebSocket connection is closed with a 4002 close code.

Rev AI officially supports these content types:

RAW file content type

You are required to provide additional information in content_type when content_type is audio/x-raw.

Parameter (type) Description Allowed Values Required
layout (string) The layout of channels within a buffer. Possible values are "interleaved" (for LRLRLRLR) and "non-interleaved" (LLLLRRRR). Not case-sensitive interleaved,non-interleaved audio/x-raw only
rate (int) Sample rate of the audio bytes Inclusive Range from 8000-48000Hz audio/x-raw only
format (string) Format of the audio samples. Case-sensitive. See Allowed Values column for valid values List of valid formats audio/x-raw only
channels (int) Number of audio channels that the audio samples contain Inclusive range from 1-10 channels audio/x-raw only

These parameters follow the content_type, delimited by semi-colons (;). Each parameter should be specified in the format parameter_name=parameter_value.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Language

attention

Custom Prices (other than the default) are set independently by language. Please refer to your contract for pricing information. If you are not a contract customer the pricing is found here

Specify the transcription language with the language query parameter. When the language is not provided, transcription will default to English. The language query parameter cannot be used along with the following options: filter_profanity, remove_disfluencies, and custom_vocabulary_id.

Language Language Code
English en
French fr
German de
Italian it
Japanese ja
Korean ko
Mandarin cmn
Portuguese pt
Spanish es

Additional requirements for content type:

  • content_type must be audio/x-raw or audio/x-flac
  • when providing raw audio, it must be formatted as S16LE
  • rate must be included, regardless of content_type

Examples

Copy
Copied
# WebSocket protocol with audio/x-raw
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&language=<LANGUAGE_CODE>

# WebSocket protocol with audio/x-flac
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-flac;rate=16000&language=<LANGUAGE_CODE>

Metadata

Metadata to be associated with the job may be provided via the metadata query parameter.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Custom vocabulary

If your streaming session contains domain-specific words or phrases that may not be in our dictionary, you can improve the accuracy of your transcript by creating a custom vocabulary to include with your streaming session.

The custom vocabulary id can be included as a query parameter in the request to the Streaming Speech-to-Text API. The API will now be able to recognize the new word every time it is spoken.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-wav&custom_vocabulary_id=cv5ZqltdU0lFA4

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
attention

Learn more about using a custom vocabulary.

Profanity filter

Optionally, you can filter profanity by adding the filter_profanity=true query parameter to your streaming request. This setting is disabled by default. We filter approximately 600 profanities, which covers most use cases. If a transcribed word matches a word on this list, all characters of the word will be replaced by asterisks except for the first and last character.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&filter_profanity=true

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Here is an example partial hypothesis response with filtered profanity:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 1.43,
  "elements": [
    {
      "type": "text",
      "value": "a*****e"
    }
  ],
}

Disfluencies

You can choose to remove disfluencies from showing up in the resulting transcript by adding the remove_disfluencies=true query parameter to your streaming request. This setting is optional and defaults to false.

attention

Currently, disfluencies are defined as 'ums' and 'uhs'. When this remove_disfluencies is set to true, disfluencies will not appear in either partial or final hypothesis.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&remove_disfluencies=true

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Here is an example partial hypothesis response with disfluencies:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 3.24,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "uh"
    },
    {
      "type": "text",
      "value": "finds"
    },
    {
      "type": "text",
      "value": "a"
    },
    {
      "type": "text",
      "value": "way"
    }
  ],
}

Here is the partial hypothesis response for the same audio without disfluencies:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 3.24,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "finds"
    },
    {
      "type": "text",
      "value": "a"
    },
    {
      "type": "text",
      "value": "way"
    }
  ],
}

Delete after seconds

You can specify how many seconds after completion the job should be deleted, by adding delete_after_seconds query parameter to your streaming request. The number of seconds provided must range from 0 seconds to 2592000 seconds (30 days).

It may take up to 2 minutes after the scheduled time for the job to be deleted unless delete_after_seconds = 0. When delete_after_seconds = 0, the audio and transcript are immediately deleted and are never stored.

attention

The delete_after_seconds parameter is optional but, when provided, it overrides the auto delete preference set on your Rev AI account.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&delete_after_seconds=SECONDS

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Detailed partials

You can choose to receive timestamps and confidence scores in the partial hypotheses by adding the query parameter detailed_partials=true. This setting is disabled by default.

The detailed_partials parameter allows usage of values that are otherwise only available in the final hypotheses. For example:

  • Timestamps in partials can be used to display transcribed words earlier.
  • Confidence scores offer the option to write custom logic for displaying the transcribed words.
warning

When using detailed_partials, you can expect a slight degradation of 1% in word error rate for the final hypothesis.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&detailed_partials=true

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Here is an example partial hypothesis response without additional details:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 2.18,
  "elements": [
    {
      "type": "text",
      "value": "life"
    },
    {
      "type": "text",
      "value": "begins"
    }
  ]
}

Here is an example partial hypothesis response for the same response with additional details:

Copy
Copied
{
  "type": "partial",
  "ts": 0.0,
  "end_ts": 2.18,
  "elements": [
    {
      "type": "text",
      "value": "life",
      "ts": 0.0,
      "end_ts": 1.5,
      "confidence": 0.83
    },
    {
      "type": "text",
      "value": "begins",
      "ts": 1.5,
      "end_ts": 1.98,
      "confidence": 0.7
    }
  ]
}

Start timestamp

You can provide a starting timestamp to offset all hypotheses timings by adding start_ts as a query parameter to the request. If provided, all output hypotheses will have their ts and end_ts offset by the amount of seconds provided in the start_ts parameter.

The start_ts parameter must be a positive double value.

Example

Copy
Copied
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&start_ts=60.5

# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Here is an example response without a specified start timestamp:

Copy
Copied
{
  "type": "final",
  "ts": 1.01,
  "end_ts": 3.2,
  "elements": [
    {
      "type": "text",
      "value": "One",
      "ts": 1.04,
      "end_ts": 1.55,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "two",
      "ts": 1.84,
      "end_ts": 2.15,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": "."
    }
  ]
}

Here is an example response without the start timestamp set to 60.5:

Copy
Copied
{
  "type": "final",
  "ts": 61.51,
  "end_ts": 63.7,
  "elements": [
    {
      "type": "text",
      "value": "One",
      "ts": 61.54,
      "end_ts": 62.05,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": " "
    },
    {
      "type": "text",
      "value": "two",
      "ts": 62.34,
      "end_ts": 62.65,
      "confidence": 1.0
    },
    {
      "type": "punct",
      "value": "."
    }
  ]
}

Request stages

RTMP streams

Initial connection

All requests begin as an HTTP POST request. An RTMP streaming session is requested with the user’s access token as a Bearer authentication token. The request body is a JSON document with streaming job options.

Client --> Rev AI
POST /speechtotext/v1/live_stream/rtmp HTTP/1.1
Host: api.rev.ai
Accept: */*
Authorization: Bearer <REVAI_ACCESS_TOKEN>
Content-Type: application/json
Content-Length: 19

{
  "metadata":"test",
  "filter_profanity": "true",
  "detailed_partials": "true",
  "remove_disfluencies": "true"
}

If authorization is successful, the client receives a JSON response which looks like the example response below:

Copy
Copied
{
  "ingestion_url":"rtmps://rtmp.rev.ai/streaming/v1",
  "stream_name":"wt=JFynCno&md=test&fp=True&rd=True&dp=True&cw=60",
  "read_url":"wss://api.rev.ai/speechtotext/v1/read_stream?read_token=ZZ-132"
}

WebSocket reader connection

A WebSocket connection must be opened to the read_url URL in order to receive streaming results. Once this connection is opened, results from your RTMP audio stream will be generated in the same response format as that for WebSocket streams.

warning

The WebSocket connection should be opened before the RTMP connection is opened in order to ensure no audio data is skipped.

RTMP audio submission

The RTMP audio can now be streamed to the ingestion_url URL endpoint provided in the initial response using the provided stream_name as the stream name for that session.

For example, to stream RTMP audio using ffmpeg, use the following command, replacing the <AUDIO_FILE> placeholder with the path to your audio file and the <INGESTION_URL> and <STREAM_NAME> placeholders with the values from the initial JSON response.

Copy
Copied
ffmpeg -re -i <AUDIO_FILE> -c copy -f flv <INGESTION_URL>/<STREAM_NAME>
attention

When using ffmpeg, escape the & characters within the <STREAM_NAME> placeholder by placing a backslash (\) before each of them.

WebSocket protocol

Initial connection

All requests begin as an HTTP GET request. A WebSocket request is declared by including the header value Upgrade: websocket and Connection: Upgrade.

Client --> Rev AI
GET /speechtotext/v1/stream HTTP/1.1
Host: api.rev.ai
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: Chxzu/uTUCmjkFH9d/8NTg==
Sec-WebSocket-Version: 13
Origin: http://api.rev.ai

If authorization is successful, the request is upgraded to a WebSocket connection.

Client <-- Rev AI
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: z0pcAwXZZRVlMcca8lmHCPzvrKU=

After the connection has been upgraded, the servers will return a "connected" message. You must wait for this connected message before sending binary audio data. The response includes an id, which is the corresponding job identifier, as shown in the example below:

Copy
Copied
{
    "type": "connected",
    "id": s1d24ax2fd21
}
warning

If Rev AI currently does not have the capacity to handle the request, a WebSocket close message is returned with status code of 4013. A HTTP/1.1 400 Bad Request response indicates that the request is not a WebSocket upgrade request.

Audio submission

WebSocket messages sent to Rev AI must be of one of these two WebSocket message types:

Message type Message requirements Notes
Binary Audio data is transmitted as binary data and should be sent in chunks of 250ms or more.

Streams sending audio chunks that are less than 250ms in size may experience increased transcription latency.
The format of the audio must match that specified in the content_type parameter.
Text The client should send an End-Of-Stream("EOS") text message to signal the end of audio data, and thus gracefully close the WebSocket connection.

On an EOS message, Rev AI will return a final hypothesis along with a WebSocket close message.
Currently, only this one text message type is supported.

WebSocket close type messages are explicitly not supported as a message type and will abruptly close the socket connection with a 1007 Invalid Payload error. Clients will not receive their final hypothesis in this case.

Any other text messages, including incorrectly capitalized messages such as "eos" and "Eos", are invalid and will also close the socket connection with a 1007 Invalid Payload error.
attention

See examples of the sequence of messages between a client and the Streaming Speech-to-Text API.