Overview
There are two ways to interact with the Streaming Speech-to-Text API:
- WebSocket protocol
- RTMP streams
attention
For a non-streaming solution, refer to the Asynchronous Speech-to-Text API documentation.
WebSocket protocol
API endpoint
All connections to Rev AI's Streaming Speech-to-Text API start as a WebSocket handshake HTTP request to wss://api.rev.ai/speechtotext/v1/stream
. On successful authorization, the client can start sending binary WebSocket messages containing audio data in one of the supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
warning
The base URL is different from the base URL for the Asynchronous Speech-to-Text API.
Example
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
Requests
A WebSocket request consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
Base URL | Yes | ||
Access token | access_token |
Yes | None |
Content type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom vocabulary | custom_vocabulary_id |
No | None |
Profanity filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete after seconds | delete_after_seconds |
No | None |
Detailed partials | detailed_partials |
No | false |
Start timestamp | start_ts |
No | None |
Learn more about request parameters.
Responses
All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial
hypothesis and final
hypothesis. The JSON will contain a type
property which indicates what kind of response the message is.
API limits
The following limits are in place for the Streaming Speech-to-Text API:
- Streaming concurrency limit is 10.
- Time limit per stream is 3 hours.
When your stream approaches the 3-hour limit, you should initialize a new concurrent WebSocket connection. Once your WebSocket connection is accepted and the "connected"
type message is received, you can switch to the new WebSocket and begin streaming audio to it.
attention
The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.
RTMP streams
API endpoint
warning
The base URL is different from the base URL for the Asynchronous Speech-to-Text API.
All Real-Time Messaging Protocol (RTMP) streaming connections to Rev AI's Streaming Speech-to-Text API start as a POST HTTP request to https://api.rev.ai/speechtotext/v1/live_stream/rtmp
with the user's access token as a Bearer authentication token. Users should include their intended job options in this HTTP POST request.
On successful authorization, the API returns a JSON object containing read_url
and ingestion_url
URL endpoints and a stream_name
value for the session. The ingestion_url
URL will have the correct query parameters and values for the job as specified by the user.
The client can now make a WebSocket connection to the read_url
to receive streaming results and then begin streaming audio to the RTMP ingestion_url
provided in the response using the provided stream_name
as the stream name for that session. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
Example
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Requests
A WebSocket request consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
_read_url URL |
Yes | ||
Access token | access_token |
Yes | None |
Content type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom vocabulary | custom_vocabulary_id |
No | None |
Profanity filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete after seconds | delete_after_seconds |
No | None |
Detailed partials | detailed_partials |
No | false |
Start timestamp | start_ts |
No | None |
Learn more about request parameters.
Responses
All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial
hypothesis and final
hypothesis. The JSON will contain a type
property which indicates what kind of response the message is.
API limits
The following limits are in place for the Streaming Speech-to-Text API:
- Streaming concurrency limit is 10.
- Time limit per stream is 3 hours.
When your stream approaches the 3-hour limit, you must request and obtain a new ingestion_url
, read_url
and stream_name
. You should then initialize a new concurrent WebSocket connection to the new read_url
endpoint. Once your WebSocket connection is accepted and the "connected"
type message is received, you can switch to the new ingestion_url
endpoint and begin streaming RTMP audio to it.
attention
The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.
Formats
Although the Streaming Speech-to-Text API technically supports all the formats supported by FFmpeg, it is recommended to send audio streams as raw audio, FLAC or WAV as other formats can result in slightly increased latency and inconsistent results.
Error codes
WebSocket close messages have a range of default error codes that signal why the socket connection was closed. See RFC-6455 for a range of the pre-defined error codes.
In addition to these error codes, the following table defines Rev AI custom error codes in the 4xxx
range. Some errors can be resolved simply by retrying the request. The table indicates which errors are likely to be resolved with successive retries.
Error Code | Description | Retry? |
---|---|---|
4001 | Unauthorized. Returned when the provided access token is invalid. | No |
4002 | Bad request. Returned when the connection’s content-type is invalid, metadata contains too many characters or the custom vocabulary does not exist with that id . |
No |
4003 | Insufficient credits. Returned when the client does not have enough credits to continue the streaming session. | No |
4010 | Server shutting down. The connection was terminated due to the server shutting down. | Yes |
4013 | No instance available. No available streaming instances were found. User should attempt to retry the connection later. | Yes |
4029 | Too many requests. The number of concurrent connections exceeded the limit. Contact customer support to increase it. | No |
attention
It is recommended that the maximum number of retries be limited to 5 attempts per request.
Billing
For billing purposes, streams are charged per second, with a minimum charge of 15 seconds.
Here are some examples:
- 4-second streams are charged as 15 seconds.
- 14.1-second streams are charged as 15 seconds.
- 15-second streams are charged as 15 seconds.
- 16-second streams are charged as 16 seconds.
- 16.1-second streams are charged as 17 seconds.
- 22.7-second streams are charged as 23 seconds.