Requests
There are two ways to interact with the Streaming Speech-to-Text API:
- WebSocket protocol
- RTMP streams
All connections to this API start as either a WebSocket handshake HTTP request (for WebSocket streams) or an HTTP POST request (for RTMP streams).
On successful authorization, the client can start sending binary WebSocket messages containing audio data or an RTMP audio stream in one of our supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
Request parameters
A WebSocket request to the Streaming Speech-to-Text API consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
Base URL (WebSocket) or read_url URL (RTMP) |
Yes | ||
Access token | access_token |
Yes | None |
Content type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom vocabulary | custom_vocabulary_id |
No | None |
Profanity filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete after seconds | delete_after_seconds |
No | None |
Detailed partials | detailed_partials |
No | false |
Start timestamp | start_ts |
No | None |
Access token
Clients must authenticate by including their Rev AI access token as a query parameter in their requests. If access_token
is invalid or the query parameter is not present, the WebSocket connection will be closed with code 4001
.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Content type
All requests must also contain a content_type
query parameter. The content type describes the format of audio data being sent. If you are submitting raw audio, Rev AI requires extra parameters as shown below. If the content type is invalid or not set, the WebSocket connection is closed with a 4002 close
code.
Rev AI officially supports these content types:
-
audio/x-raw
(has additional requirements ) -
audio/x-flac
-
audio/x-wav
RAW file content type
You are required to provide additional information in content_type
when content_type
is audio/x-raw
.
Parameter (type) | Description | Allowed Values | Required |
---|---|---|---|
layout (string) | The layout of channels within a buffer. Possible values are "interleaved" (for LRLRLRLR) and "non-interleaved" (LLLLRRRR). Not case-sensitive | interleaved,non-interleaved | audio/x-raw only |
rate (int) | Sample rate of the audio bytes | Inclusive Range from 8000-48000Hz | audio/x-raw only |
format (string) | Format of the audio samples. Case-sensitive. See Allowed Values column for valid values |
List of valid formats | audio/x-raw only |
channels (int) | Number of audio channels that the audio samples contain | Inclusive range from 1-10 channels | audio/x-raw only |
These parameters follow the content_type
, delimited by semi-colons (;
). Each parameter should be specified in the format parameter_name=parameter_value
.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Language
attention
Custom Prices (other than the default) are set independently by language. Please refer to your contract for pricing information. If you are not a contract customer the pricing is found here
Specify the transcription language with the language
query parameter. When the language is not provided, transcription
will default to English. The language
query parameter cannot be used along with the following options:
filter_profanity
, remove_disfluencies
, and custom_vocabulary_id
.
Language | Language Code |
---|---|
English | en |
French | fr |
German | de |
Italian | it |
Japanese | ja |
Korean | ko |
Mandarin | cmn |
Portuguese | pt |
Spanish | es |
Additional requirements for content type:
-
content_type
must beaudio/x-raw
oraudio/x-flac
-
when providing raw audio, it must be formatted as
S16LE
-
rate
must be included, regardless of content_type
Examples
# WebSocket protocol with audio/x-raw
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&language=<LANGUAGE_CODE>
# WebSocket protocol with audio/x-flac
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-flac;rate=16000&language=<LANGUAGE_CODE>
Metadata
Metadata to be associated with the job may be provided via the metadata
query parameter.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Custom vocabulary
If your streaming session contains domain-specific words or phrases that may not be in our dictionary, you can improve the accuracy of your transcript by creating a custom vocabulary to include with your streaming session.
The custom vocabulary id
can be included as a query parameter in the request to the Streaming Speech-to-Text API. The API will now be able to recognize the new word every time it is spoken.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-wav&custom_vocabulary_id=cv5ZqltdU0lFA4
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
attention
Learn more about using a custom vocabulary.
Profanity filter
Optionally, you can filter profanity by adding the filter_profanity=true
query parameter to your streaming request. This setting is disabled by default. We filter approximately 600 profanities, which covers most use cases. If a transcribed word matches a word on this list, all characters of the word will be replaced by asterisks except for the first and last character.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&filter_profanity=true
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Here is an example partial hypothesis response with filtered profanity:
{
"type": "partial",
"ts": 0.0,
"end_ts": 1.43,
"elements": [
{
"type": "text",
"value": "a*****e"
}
],
}
Disfluencies
You can choose to remove disfluencies from showing up in the resulting transcript by adding the remove_disfluencies=true
query parameter to your streaming request. This setting is optional and defaults to false
.
attention
Currently, disfluencies are defined as 'ums' and 'uhs'. When this remove_disfluencies
is set to true
, disfluencies will not appear in either partial or final hypothesis.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&remove_disfluencies=true
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Here is an example partial hypothesis response with disfluencies:
{
"type": "partial",
"ts": 0.0,
"end_ts": 3.24,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "uh"
},
{
"type": "text",
"value": "finds"
},
{
"type": "text",
"value": "a"
},
{
"type": "text",
"value": "way"
}
],
}
Here is the partial hypothesis response for the same audio without disfluencies:
{
"type": "partial",
"ts": 0.0,
"end_ts": 3.24,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "finds"
},
{
"type": "text",
"value": "a"
},
{
"type": "text",
"value": "way"
}
],
}
Delete after seconds
You can specify how many seconds after completion the job should be deleted, by adding delete_after_seconds
query parameter to your streaming request. The number of seconds provided must range from 0
seconds to 2592000
seconds (30 days).
It may take up to 2 minutes after the scheduled time for the job to be deleted unless delete_after_seconds = 0
. When delete_after_seconds = 0
, the audio and transcript are immediately deleted and are never stored.
attention
The delete_after_seconds
parameter is optional but, when provided, it overrides the auto delete preference set on your Rev AI account.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&delete_after_seconds=SECONDS
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Detailed partials
You can choose to receive timestamps and confidence scores in the partial hypotheses by adding the query parameter detailed_partials=true
. This setting is disabled by default.
The detailed_partials
parameter allows usage of values that are otherwise only available in the final hypotheses. For example:
- Timestamps in partials can be used to display transcribed words earlier.
- Confidence scores offer the option to write custom logic for displaying the transcribed words.
warning
When using detailed_partials
, you can expect a slight degradation of 1% in word error rate for the final hypothesis.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&detailed_partials=true
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Here is an example partial hypothesis response without additional details:
{
"type": "partial",
"ts": 0.0,
"end_ts": 2.18,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "begins"
}
]
}
Here is an example partial hypothesis response for the same response with additional details:
{
"type": "partial",
"ts": 0.0,
"end_ts": 2.18,
"elements": [
{
"type": "text",
"value": "life",
"ts": 0.0,
"end_ts": 1.5,
"confidence": 0.83
},
{
"type": "text",
"value": "begins",
"ts": 1.5,
"end_ts": 1.98,
"confidence": 0.7
}
]
}
Start timestamp
You can provide a starting timestamp to offset all hypotheses timings by adding start_ts
as a query parameter to the request. If provided, all output hypotheses will have their ts
and end_ts
offset by the amount of seconds provided in the start_ts
parameter.
The start_ts
parameter must be a positive double value.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&start_ts=60.5
# RTMP streams
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Here is an example response without a specified start timestamp:
{
"type": "final",
"ts": 1.01,
"end_ts": 3.2,
"elements": [
{
"type": "text",
"value": "One",
"ts": 1.04,
"end_ts": 1.55,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "two",
"ts": 1.84,
"end_ts": 2.15,
"confidence": 1.0
},
{
"type": "punct",
"value": "."
}
]
}
Here is an example response without the start timestamp set to 60.5
:
{
"type": "final",
"ts": 61.51,
"end_ts": 63.7,
"elements": [
{
"type": "text",
"value": "One",
"ts": 61.54,
"end_ts": 62.05,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "two",
"ts": 62.34,
"end_ts": 62.65,
"confidence": 1.0
},
{
"type": "punct",
"value": "."
}
]
}
Request stages
RTMP streams
Initial connection
All requests begin as an HTTP POST
request. An RTMP streaming session is requested with the user’s access token as a Bearer authentication token. The request body is a JSON document with streaming job options.
Client --> Rev AI
POST /speechtotext/v1/live_stream/rtmp HTTP/1.1
Host: api.rev.ai
Accept: */*
Authorization: Bearer <REVAI_ACCESS_TOKEN>
Content-Type: application/json
Content-Length: 19
{
"metadata":"test",
"filter_profanity": "true",
"detailed_partials": "true",
"remove_disfluencies": "true"
}
If authorization is successful, the client receives a JSON response which looks like the example response below:
{
"ingestion_url":"rtmps://rtmp.rev.ai/streaming/v1",
"stream_name":"wt=JFynCno&md=test&fp=True&rd=True&dp=True&cw=60",
"read_url":"wss://api.rev.ai/speechtotext/v1/read_stream?read_token=ZZ-132"
}
WebSocket reader connection
A WebSocket connection must be opened to the read_url
URL in order to receive streaming results. Once this connection is opened, results from your RTMP audio stream will be generated in the same response format as that for WebSocket streams.
warning
The WebSocket connection should be opened before the RTMP connection is opened in order to ensure no audio data is skipped.
RTMP audio submission
The RTMP audio can now be streamed to the ingestion_url
URL endpoint provided in the initial response using the provided stream_name
as the stream name for that session.
For example, to stream RTMP audio using ffmpeg
, use the following command, replacing the <AUDIO_FILE>
placeholder with the path to your audio file and the <INGESTION_URL>
and <STREAM_NAME>
placeholders with the values from the initial JSON response.
ffmpeg -re -i <AUDIO_FILE> -c copy -f flv <INGESTION_URL>/<STREAM_NAME>
attention
When using ffmpeg
, escape the &
characters within the <STREAM_NAME>
placeholder by placing a backslash (\
) before each of them.
WebSocket protocol
Initial connection
All requests begin as an HTTP GET
request. A WebSocket request is declared by including the header value Upgrade: websocket
and Connection: Upgrade
.
Client --> Rev AI
GET /speechtotext/v1/stream HTTP/1.1
Host: api.rev.ai
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: Chxzu/uTUCmjkFH9d/8NTg==
Sec-WebSocket-Version: 13
Origin: http://api.rev.ai
If authorization is successful, the request is upgraded to a WebSocket connection.
Client <-- Rev AI
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: z0pcAwXZZRVlMcca8lmHCPzvrKU=
After the connection has been upgraded, the servers will return a "connected"
message. You must wait for this connected message before sending binary audio data. The response includes an id
, which is the corresponding job identifier, as shown in the example below:
{
"type": "connected",
"id": s1d24ax2fd21
}
warning
If Rev AI currently does not have the capacity to handle the request, a WebSocket close message is returned with status code of 4013
. A HTTP/1.1 400 Bad Request
response indicates that the request is not a WebSocket upgrade request.
Audio submission
WebSocket messages sent to Rev AI must be of one of these two WebSocket message types:
Message type | Message requirements | Notes |
---|---|---|
Binary | Audio data is transmitted as binary data and should be sent in chunks of 250ms or more. Streams sending audio chunks that are less than 250ms in size may experience increased transcription latency. |
The format of the audio must match that specified in the content_type parameter. |
Text | The client should send an End-Of-Stream("EOS" ) text message to signal the end of audio data, and thus gracefully close the WebSocket connection.On an EOS message, Rev AI will return a final hypothesis along with a WebSocket close message. |
Currently, only this one text message type is supported. WebSocket close type messages are explicitly not supported as a message type and will abruptly close the socket connection with a 1007 Invalid Payload error. Clients will not receive their final hypothesis in this case.Any other text messages, including incorrectly capitalized messages such as "eos" and "Eos" , are invalid and will also close the socket connection with a 1007 Invalid Payload error. |
attention
See examples of the sequence of messages between a client and the Streaming Speech-to-Text API.