Overview
There are two ways to interact with the Streaming Speech-to-Text API:
- WebSocket protocol
- RTMP streams
attention
For a non-streaming solution, refer to the Asynchronous Speech-to-Text API documentation.
WebSocket protocol
API endpoint
All connections to Rev AI's Streaming Speech-to-Text API start as a WebSocket handshake HTTP request to wss://api.rev.ai/speechtotext/v1/stream
. On successful authorization, the client can start sending binary WebSocket messages containing audio data in one of the supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
warning
The base URL is different from the base URL for the Asynchronous Speech-to-Text API.
Example
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>
Requests
A WebSocket request consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
Base URL | Yes | ||
Access token | access_token |
Yes | None |
Content type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom vocabulary | custom_vocabulary_id |
No | None |
Profanity filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete after seconds | delete_after_seconds |
No | None |
Detailed partials | detailed_partials |
No | false |
Start timestamp | start_ts |
No | None |
Maximum Segment Duration Seconds | max_segment_duration_seconds |
No | None |
Transcriber | transcriber |
No | See transcriber section |
Speaker Switch | enable_speaker_switch |
No | false |
Skip Post-processing | skip_postprocessing |
No | false |
Priority | priority |
No | speed |
Maximum wait time for connection | max_connection_wait_seconds |
No | 60 |
Learn more about request parameters.
Responses
All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial
hypothesis and final
hypothesis. The JSON will contain a type
property which indicates what kind of response the message is.
API limits
The following limits are in place for the Streaming Speech-to-Text API:
- Streaming concurrency limit is 10.
- Time limit per stream is 3 hours.
When your stream approaches the 3-hour limit, you should initialize a new concurrent WebSocket connection. Once your WebSocket connection is accepted and the "connected"
type message is received, you can switch to the new WebSocket and begin streaming audio to it.
attention
The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.
RTMP streams
warning
RTMP streams is not supported by HIPAA
API endpoint
warning
The base URL is different from the base URL for the Asynchronous Speech-to-Text API.
All Real-Time Messaging Protocol (RTMP) streaming connections to Rev AI's Streaming Speech-to-Text API start as a POST HTTP request to https://api.rev.ai/speechtotext/v1/live_stream/rtmp
with the user's access token as a Bearer authentication token. Users should include their intended job options in this HTTP POST request.
On successful authorization, the API returns a JSON object containing read_url
and ingestion_url
URL endpoints and a stream_name
value for the session. The ingestion_url
URL will have the correct query parameters and values for the job as specified by the user.
The client can now make a WebSocket connection to the read_url
to receive streaming results and then begin streaming audio to the RTMP ingestion_url
provided in the response using the provided stream_name
as the stream name for that session. As speech is detected, Rev AI returns hypotheses of the recognized speech content.
Example
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>
Requests
A WebSocket request consists of the following parts:
Request parameter | Required | Default | |
---|---|---|---|
read_url URL |
Yes | ||
Access token | access_token |
Yes | None |
Content type | content_type |
Yes | None |
Language | language |
No | en |
Metadata | metadata |
No | None |
Custom vocabulary | custom_vocabulary_id |
No | None |
Profanity filter | filter_profanity |
No | false |
Disfluencies | remove_disfluencies |
No | false |
Delete after seconds | delete_after_seconds |
No | None |
Detailed partials | detailed_partials |
No | false |
Start timestamp | start_ts |
No | None |
Maximum Segment Duration Seconds | max_segment_duration_seconds |
No | None |
Transcriber | transcriber |
No | See transcriber section |
Speaker Switch | enable_speaker_switch |
No | false |
Skip Post-processing | skip_postprocessing |
No | false |
Priority | priority |
No | speed |
Maximum wait time for connection | max_connection_wait_seconds |
No | 60 |
Learn more about request parameters.
Responses
All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial
hypothesis and final
hypothesis. The JSON will contain a type
property which indicates what kind of response the message is.
API limits
The following limits are in place for the Streaming Speech-to-Text API:
- Streaming concurrency limit is 10.
- Time limit per stream is 3 hours.
When your stream approaches the 3-hour limit, you must request and obtain a new ingestion_url
, read_url
and stream_name
. You should then initialize a new concurrent WebSocket connection to the new read_url
endpoint. Once your WebSocket connection is accepted and the "connected"
type message is received, you can switch to the new ingestion_url
endpoint and begin streaming RTMP audio to it.
attention
The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.
Formats
Although the Streaming Speech-to-Text API technically supports all the formats supported by FFmpeg, it is recommended to send audio streams as raw audio, FLAC or WAV as other formats can result in slightly increased latency and inconsistent results.
HIPAA compliance
The API supports HIPAA-compliant processing. However, this feature is not activated by default and must be explicitly activated at account level. Learn more about Rev AI's HIPAA compliance and how to HIPAA-enable a Rev AI user account.
The API has the following limitations in HIPAA context:
- RTMP streams are not supported.
Error codes
WebSocket close messages have a range of default error codes that signal why the socket connection was closed. See RFC-6455 for a range of the pre-defined error codes.
In addition to these error codes, the following table defines Rev AI custom error codes in the 4xxx
range. Some errors can be resolved simply by retrying the request. The table indicates which errors are likely to be resolved with successive retries.
Error Code | Description | Retry? |
---|---|---|
4001 | Unauthorized. Returned when the provided access token is invalid. | No |
4002 | Bad request. Returned when the connection’s content-type is invalid, metadata contains too many characters or the custom vocabulary does not exist with that id . |
No |
4003 | Insufficient credits. Returned when the client does not have enough credits to continue the streaming session. | No |
4010 | Server shutting down. The connection was terminated due to the server shutting down. | Yes |
4013 | No instance available. No available streaming instances were found. User should attempt to retry the connection later. | Yes |
4029 | Too many requests. The number of concurrent connections exceeded the limit. Contact customer support to increase it. | No |
attention
It is recommended that the maximum number of retries be limited to 5 attempts per request.
Billing
For billing purposes we track two values during each stream: stream duration and audio duration. At the end of each stream you will be charged for the larger of the two, rounded up to the nearest second, with an absolute minimum of 15 seconds.
Audio duration (AD) refers to the number of seconds of audio that have been sent over the WebSocket. Stream duration (SD) refers to the number of real world seconds which have passed since the WebSocket connection was established.
Here are some examples:
- AD: 4.1 seconds, SD: 4.1 seconds. Charged as 15 seconds.
- AD: 14.1 seconds, SD: 14.1 seconds. Charged as 15 seconds.
- AD: 15 seconds, SD: 15 seconds. Charged as 15 seconds.
- AD: 15 seconds, SD: 16 seconds. Charged as 16 seconds.
- AD: 16.1 seconds, SD: 16.1 seconds. Charged as 17 seconds.
- AD: 24.7 seconds, SD: 14 seconds. Charged as 24 seconds.
- AD: 14 seconds, SD: 24 seconds. Charged as 24 seconds.