Requests
There are two ways to interact with the Streaming Speech-to-Text API:
- WebSocket protocol
- RTMP streams
All connections to this API start as either a WebSocket handshake HTTP request (for WebSocket streams) or an HTTP POST request (for RTMP streams).
On successful authorization, the client can start sending binary WebSocket messages containing audio data or an RTMP audio stream in one of our supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
Request parameters
A WebSocket request to the Streaming Speech-to-Text API consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
Base URL (WebSocket) or read_url URL (RTMP) |
Yes | ||
Access Token | access_token |
Yes | None |
Content Type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom Vocabulary | custom_vocabulary_id |
No | None |
Profanity Filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete After Seconds | delete_after_seconds |
No | None |
Detailed Partials | detailed_partials |
No | false |
Start Timestamp | start_ts |
No | None |
Maximum segment duration seconds | max_segment_duration_seconds |
No | None |
Transcriber | transcriber |
No | See transcriber section |
Speaker switch detection | enable_speaker_switch |
No | false |
Skip Post-processing | skip_postprocessing |
No | false |
Priority | priority |
No | speed |
Maximum wait time for connection | max_connection_wait_seconds |
No | 60 |
Access token
Clients must authenticate by including their Rev AI access token as a query parameter in their requests. If access_token
is invalid or the query parameter is not present, the WebSocket connection will be closed with code 4001
.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1
Content type
All requests must also contain a content_type
query parameter. The content type describes the format of audio data being sent. If you are submitting raw audio, Rev AI requires extra parameters as shown below. If the content type is invalid or not set, the WebSocket connection is closed with a 4002 close
code.
Rev AI officially supports these content types:
-
audio/x-raw
(has additional requirements ) -
audio/x-flac
-
audio/x-wav
RAW file content type
You are required to provide additional information in content_type
when content_type
is audio/x-raw
.
Parameter (type) | Description | Allowed Values | Required |
---|---|---|---|
layout (string) | The layout of channels within a buffer. Possible values are "interleaved" (for LRLRLRLR) and "non-interleaved" (LLLLRRRR). Not case-sensitive | interleaved,non-interleaved | audio/x-raw only |
rate (int) | Sample rate of the audio bytes | Inclusive Range from 8000-48000Hz | audio/x-raw only |
format (string) | Format of the audio samples. Case-sensitive. See Allowed Values column for valid values |
List of valid formats | audio/x-raw only |
channels (int) | Number of audio channels that the audio samples contain | Inclusive range from 1-10 channels | audio/x-raw only |
These parameters follow the content_type
, delimited by semi-colons (;
). Each parameter should be specified in the format parameter_name=parameter_value
.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
Language
attention
Custom Prices (other than the default) are set independently by language. Please refer to your contract for pricing information. If you are not a contract customer the pricing is found here
Specify the transcription language with the language
query parameter. When the language is not provided, transcription will default to English. The language
query parameter cannot be used along with the following options: filter_profanity
, remove_disfluencies
, and custom_vocabulary_id
.
Language | Language Code |
---|---|
English | en |
French | fr |
German | de |
Italian | it |
Japanese | ja |
Korean | ko |
Mandarin | cmn |
Portuguese | pt |
Spanish | es |
Additional requirements for content type:
-
content_type
must beaudio/x-raw
oraudio/x-flac
-
when providing raw audio, it must be formatted as
S16LE
-
rate
must be included, regardless of content_type
Examples
# WebSocket protocol with audio/x-raw
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&language=<LANGUAGE_CODE>
# WebSocket protocol with audio/x-flac
wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-flac;rate=16000&language=<LANGUAGE_CODE>
Metadata
Metadata to be associated with the job may be provided via the metadata
query parameter.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
Custom vocabulary
If your streaming session contains domain-specific words or phrases that may not be in our dictionary, you can improve the accuracy of your transcript by creating a custom vocabulary to include with your streaming session.
The custom vocabulary id
can be included as a query parameter in the request to the Streaming Speech-to-Text API. The API will now be able to recognize the new word every time it is spoken.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-wav&custom_vocabulary_id=cv5ZqltdU0lFA4
attention
Learn more about using a custom vocabulary.
Profanity filter
Optionally, you can filter profanity by adding the filter_profanity=true
query parameter to your streaming request. This setting is disabled by default. We filter approximately 600 profanities, which covers most use cases. If a transcribed word matches a word on this list, all characters of the word will be replaced by asterisks except for the first and last character.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&filter_profanity=true
Here is an example partial hypothesis response with filtered profanity:
{
"type": "partial",
"ts": 0.0,
"end_ts": 1.43,
"elements": [
{
"type": "text",
"value": "a*****e"
}
],
}
Disfluencies
You can choose to remove disfluencies from showing up in the resulting transcript by adding the remove_disfluencies=true
query parameter to your streaming request. This setting is optional and defaults to false
.
attention
Currently, disfluencies are defined as 'ums' and 'uhs'. When this remove_disfluencies
is set to true
, disfluencies will not appear in either partial or final hypothesis.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&remove_disfluencies=true
Here is an example partial hypothesis response with disfluencies:
{
"type": "partial",
"ts": 0.0,
"end_ts": 3.24,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "uh"
},
{
"type": "text",
"value": "finds"
},
{
"type": "text",
"value": "a"
},
{
"type": "text",
"value": "way"
}
],
}
Here is the partial hypothesis response for the same audio without disfluencies:
{
"type": "partial",
"ts": 0.0,
"end_ts": 3.24,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "finds"
},
{
"type": "text",
"value": "a"
},
{
"type": "text",
"value": "way"
}
]
}
Delete after seconds
You can specify how many seconds after completion the job should be deleted, by adding delete_after_seconds
query parameter to your streaming request. The number of seconds provided must range from 0
seconds to 2592000
seconds (30 days).
It may take up to 2 minutes after the scheduled time for the job to be deleted unless delete_after_seconds = 0
. When delete_after_seconds = 0
, the audio and transcript are immediately deleted and are never stored.
attention
The delete_after_seconds
parameter is optional but, when provided, it overrides the auto delete preference set on your Rev AI account.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&delete_after_seconds=SECONDS
Detailed partials
You can choose to receive timestamps and confidence scores in the partial hypotheses by adding the query parameter detailed_partials=true
. This setting is disabled by default.
The detailed_partials
parameter enables usage of values that are otherwise only available in the final hypotheses. For example:
- Timestamps in partials can be used to display transcribed words earlier.
- Confidence scores offer the option to write custom logic for displaying the transcribed words.
warning
When using detailed_partials
, you can expect a slight degradation of 1% in word error rate for the final hypothesis.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&detailed_partials=true
Here is an example partial hypothesis response without additional details:
{
"type": "partial",
"ts": 0.0,
"end_ts": 2.18,
"elements": [
{
"type": "text",
"value": "life"
},
{
"type": "text",
"value": "begins"
}
]
}
Here is an example partial hypothesis response for the same response with additional details:
{
"type": "partial",
"ts": 0.0,
"end_ts": 2.18,
"elements": [
{
"type": "text",
"value": "life",
"ts": 0.0,
"end_ts": 1.5,
"confidence": 0.83
},
{
"type": "text",
"value": "begins",
"ts": 1.5,
"end_ts": 1.98,
"confidence": 0.7
}
]
}
Start timestamp
You can provide a starting timestamp to offset all hypotheses timings by adding start_ts
as a query parameter to the request. If provided, all output hypotheses will have their ts
and end_ts
offset by the amount of seconds provided in the start_ts
parameter.
The start_ts
parameter must be a positive double value.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&start_ts=60.5
Here is an example response without a specified start timestamp:
{
"type": "final",
"ts": 1.01,
"end_ts": 3.2,
"elements": [
{
"type": "text",
"value": "One",
"ts": 1.04,
"end_ts": 1.55,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "two",
"ts": 1.84,
"end_ts": 2.15,
"confidence": 1.0
},
{
"type": "punct",
"value": "."
}
]
}
Here is an example response without the start timestamp set to 60.5
:
{
"type": "final",
"ts": 61.51,
"end_ts": 63.7,
"elements": [
{
"type": "text",
"value": "One",
"ts": 61.54,
"end_ts": 62.05,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "two",
"ts": 62.34,
"end_ts": 62.65,
"confidence": 1.0
},
{
"type": "punct",
"value": "."
}
]
}
Maximum segment duration seconds
You can provide a maximum time you are willing to wait between final hypotheses, by adding the max_segment_duration_seconds
query parameter to your streaming request. The number of seconds provided must range from 5
seconds to 30
seconds.
attention
This parameter potentially changes the amount of context our engine has when creating final hypotheses and therefore has a minor effect on word error rate. Higher values correlate with fewer errors in transcription.
warning
The max_segment_duration_seconds
is not exact. Due to words potentially falling on the boundary of final hypotheses, you should expect the length of final hypotheses to be potentially 0.5 seconds longer than your specified value. For example, if you specify max_segment_duration_seconds=5
, final hypotheses may be up to 5.5 seconds in length.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&max_segment_duration_seconds=15
Output format doesn't change as a result of this parameter.
Transcriber
You can define a specific transcription model to use for your stream with the transcriber
option.
The output format does not change, regardless of which transcriber is selected.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2
Speaker switch detection
warning
This option is only available for streams using the transcriber=machine_v2
option.
You can choose to enable speaker switch detection by adding enable_speaker_switch=true
to your request. Doing so will add a new field named speaker_id
to final hypotheses. Whenever the system detects that the active speaker in the audio has changed, the speaker_id
field will be incremented to indicate this change.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&enable_speaker_switch=true
Here is an example response with speaker switch detection enabled:
{
"type": "final",
"ts": 1.01,
"end_ts": 3.2,
"speaker_id": 1000,
"elements": [
{
"type": "text",
"value": "One",
"ts": 1.04,
"end_ts": 1.55,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "two",
"ts": 1.84,
"end_ts": 2.15,
"confidence": 1.0
},
{
"type": "punct",
"value": "."
}
]
}
Skip Post-processing
Only available for English and Spanish languages
You can choose to skip post-processing operations, such as inverse text normalization (ITN), casing and punctuation, by adding skip_postprocessing=true
to your request.
Doing so will result in a small decrease in latency; however, your final hypotheses will no longer contain capitalization, punctuation, or inverse text normalization (for example, five hundred
will not be normalized to 500
).
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&skip_postprocessing=true
Here is an example response with skip_postprocessing
enabled:
{
"type": "final",
"ts": 1.01,
"end_ts": 3.2,
"elements": [
{
"type": "text",
"value": "five",
"ts": 1.04,
"end_ts": 1.55,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "hundred",
"ts": 1.84,
"end_ts": 2.15,
"confidence": 1.0
},
{
"type": "punct",
"value": " "
},
{
"type": "text",
"value": "dollars",
"ts": 2.23,
"end_ts": 2.87,
"confidence": 1.0
}
]
}
Priority
Only available for English and Spanish languages
Only available for machine_v2
transcriber
You can configure what our engine should prioritize for your stream. The available choices are speed
and accuracy
.
-
speed
is the default and means the engine will prioritize creating partial and final hypotheses more frequently to reduce latency. -
accuracy
will cause the engine to produce results less frequently but will greatly increase the accuracy of results.
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&priority=accuracy
Maximum wait time for connection
You can provide a maximum time you are willing to wait for a WebSocket connection to be available, by adding the max_connection_wait_seconds
query parameter to your streaming request. The number of seconds provided must range from 60
seconds to 600
seconds.
To handle the rare occasion that the API is unable to allocate a speech worker immediately, set this value to something greater than 60 seconds. This will help in prioritizing and assigning the requests to a speech worker in the order in which the requests were received, and scale up faster to meet the total number of requested connections.
If this timeout is reached, the client will receive a WebSocket close message with error code 4013
. This indicates that the API is unable to service the connection within the specified wait time and that the client should try again later.
attention
An increased connection latency does not mean an increased partial latency. Once connected, partial latencies will behave as normal. The client can buffer audio data during the connection wait period and send it as a burst of audio data messages after the connection to a worker is established. Audio will be processed faster than real-time and eventually partials will catch up to the client in real-time. The catch-up period is dependent on the amount of audio buffered, which is often a function of the buffered audio duration (the connection wait period).
Example
# WebSocket protocol
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=int&max_connection_wait_seconds=300
Request stages
RTMP streams
Initial connection
All requests begin as an HTTP POST
request. An RTMP streaming session is requested with the user’s access token as a Bearer authentication token. The request body is a JSON document with streaming job options.
Client --> Rev AI
POST /speechtotext/v1/live_stream/rtmp HTTP/1.1
Host: api.rev.ai
Accept: */*
Authorization: Bearer <REVAI_ACCESS_TOKEN>
Content-Type: application/json
Content-Length: 19
{
"metadata":"test",
"filter_profanity": "true",
"detailed_partials": "true",
"remove_disfluencies": "true",
"language": "fr",
"transcriber": "machine",
"max_segment_duration_seconds": 20,
"custom_vocabulary_id": "cv5ZqltdU0lFA4",
"enable_speaker_switch": false,
"skip_postprocessing": false,
"priority": "accuracy"
}
If authorization is successful, the client receives a JSON response which looks like the example response below:
{
"ingestion_url":"rtmps://rtmp.rev.ai/streaming/v1",
"stream_name":"wt=JFynCno&md=test&fp=True&rd=True&dp=True&cw=60",
"read_url":"wss://api.rev.ai/speechtotext/v1/read_stream?read_token=ZZ-132"
}
WebSocket reader connection
A WebSocket connection must be opened to the read_url
URL in order to receive streaming results. Once this connection is opened, results from your RTMP audio stream will be generated in the same response format as that for WebSocket streams.
warning
The WebSocket connection should be opened before the RTMP connection is opened in order to ensure no audio data is skipped.
RTMP audio submission
The RTMP audio can now be streamed to the ingestion_url
URL endpoint provided in the initial response using the provided stream_name
as the stream name for that session.
For example, to stream RTMP audio using ffmpeg
, use the following command, replacing the <AUDIO_FILE>
placeholder with the path to your audio file and the <INGESTION_URL>
and <STREAM_NAME>
placeholders with the values from the initial JSON response.
ffmpeg -re -i <AUDIO_FILE> -c copy -f flv <INGESTION_URL>/<STREAM_NAME>
attention
When using ffmpeg
, escape the &
characters within the <STREAM_NAME>
placeholder by placing a backslash (\
) before each of them.
WebSocket protocol
Initial connection
All requests begin as an HTTP GET
request. A WebSocket request is declared by including the header value Upgrade: websocket
and Connection: Upgrade
.
Client --> Rev AI
GET /speechtotext/v1/stream HTTP/1.1
Host: api.rev.ai
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: Chxzu/uTUCmjkFH9d/8NTg==
Sec-WebSocket-Version: 13
Origin: http://api.rev.ai
If authorization is successful, the request is upgraded to a WebSocket connection.
Client <-- Rev AI
HTTP/1.1 101 Switching Protocols
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Accept: z0pcAwXZZRVlMcca8lmHCPzvrKU=
After the connection has been upgraded, the servers will return a "connected"
message. You must wait for this connected message before sending binary audio data. The response includes an id
, which is the corresponding job identifier, as shown in the example below:
{
"type": "connected",
"id": s1d24ax2fd21
}
warning
If Rev AI currently does not have the capacity to handle the request, a WebSocket close message is returned with status code of 4013
. A HTTP/1.1 400 Bad Request
response indicates that the request is not a WebSocket upgrade request.
Audio submission
WebSocket messages sent to Rev AI must be of one of these two WebSocket message types:
Message type | Message requirements | Notes |
---|---|---|
Binary | Audio data is transmitted as binary data and should be sent in chunks of 250ms or more. Streams sending audio chunks that are less than 250ms in size may experience increased transcription latency. |
The format of the audio must match that specified in the content_type parameter. |
Text | The client should send an End-Of-Stream("EOS" ) text message to signal the end of audio data, and thus gracefully close the WebSocket connection.On an EOS message, Rev AI will return a final hypothesis along with a WebSocket close message. |
Currently, only this one text message type is supported. WebSocket close type messages are explicitly not supported as a message type and will abruptly close the socket connection with a 1007 Invalid Payload error. Clients will not receive their final hypothesis in this case.Any other text messages, including incorrectly capitalized messages such as "eos" and "Eos" , are invalid and will also close the socket connection with a 1007 Invalid Payload error. |
attention
See examples of the sequence of messages between a client and the Streaming Speech-to-Text API.