Overview

There are two ways to interact with the Streaming Speech-to-Text API:

  • WebSocket protocol
  • RTMP streams
attention

For a non-streaming solution, refer to the Asynchronous Speech-to-Text API documentation.

WebSocket protocol

API endpoint

All connections to Rev AI's Streaming Speech-to-Text API start as a WebSocket handshake HTTP request to wss://api.rev.ai/speechtotext/v1/stream. On successful authorization, the client can start sending binary WebSocket messages containing audio data in one of the supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content.

warning

The base URL is different from the base URL for the Asynchronous Speech-to-Text API.

Example

Copy
Copied
wss://api.rev.ai/speechtotext/v1/stream?access_token=<REVAI_ACCESS_TOKEN>&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata=<METADATA>

Requests

A WebSocket request consists of the following parts:

Request parameter Required Default
Base URL Yes
Access token access_token Yes None
Content type content_type Yes None
Language language No en
Metadata metadata No None
Custom vocabulary custom_vocabulary_id No None
Profanity filter filter_profanity No false
Disfluencies remove_disfluencies No false
Delete after seconds delete_after_seconds No None
Detailed partials detailed_partials No false
Start timestamp start_ts No None
Maximum Segment Duration Seconds max_segment_duration_seconds No None
Transcriber transcriber No See transcriber section
Speaker Switch enable_speaker_switch No false
Skip Post-processing skip_postprocessing No false
Priority priority No speed
Maximum wait time for connection max_connection_wait_seconds No 60

Learn more about request parameters.

Responses

All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial hypothesis and final hypothesis. The JSON will contain a type property which indicates what kind of response the message is.

Learn more about responses.

API limits

The following limits are in place for the Streaming Speech-to-Text API:

  • Streaming concurrency limit is 10.
  • Time limit per stream is 3 hours.

When your stream approaches the 3-hour limit, you should initialize a new concurrent WebSocket connection. Once your WebSocket connection is accepted and the "connected" type message is received, you can switch to the new WebSocket and begin streaming audio to it.

attention

The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.

RTMP streams

warning

RTMP streams is not supported by HIPAA

API endpoint

warning

The base URL is different from the base URL for the Asynchronous Speech-to-Text API.

All Real-Time Messaging Protocol (RTMP) streaming connections to Rev AI's Streaming Speech-to-Text API start as a POST HTTP request to https://api.rev.ai/speechtotext/v1/live_stream/rtmp with the user's access token as a Bearer authentication token. Users should include their intended job options in this HTTP POST request.

On successful authorization, the API returns a JSON object containing read_url and ingestion_url URL endpoints and a stream_name value for the session. The ingestion_url URL will have the correct query parameters and values for the job as specified by the user.

The client can now make a WebSocket connection to the read_url to receive streaming results and then begin streaming audio to the RTMP ingestion_url provided in the response using the provided stream_name as the stream name for that session. As speech is detected, Rev AI returns hypotheses of the recognized speech content.

Example

Copy
Copied
wss://api.rev.ai/speechtotext/v1/read_stream?read_token=<GENERATED_READ_TOKEN>

Requests

A WebSocket request consists of the following parts:

Request parameter Required Default
read_url URL Yes
Access token access_token Yes None
Content type content_type Yes None
Language language No en
Metadata metadata No None
Custom vocabulary custom_vocabulary_id No None
Profanity filter filter_profanity No false
Disfluencies remove_disfluencies No false
Delete after seconds delete_after_seconds No None
Detailed partials detailed_partials No false
Start timestamp start_ts No None
Maximum Segment Duration Seconds max_segment_duration_seconds No None
Transcriber transcriber No See transcriber section
Speaker Switch enable_speaker_switch No false
Skip Post-processing skip_postprocessing No false
Priority priority No speed
Maximum wait time for connection max_connection_wait_seconds No 60

Learn more about request parameters.

Responses

All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states: partial hypothesis and final hypothesis. The JSON will contain a type property which indicates what kind of response the message is.

Learn more about responses.

API limits

The following limits are in place for the Streaming Speech-to-Text API:

  • Streaming concurrency limit is 10.
  • Time limit per stream is 3 hours.

When your stream approaches the 3-hour limit, you must request and obtain a new ingestion_url, read_url and stream_name. You should then initialize a new concurrent WebSocket connection to the new read_url endpoint. Once your WebSocket connection is accepted and the "connected" type message is received, you can switch to the new ingestion_url endpoint and begin streaming RTMP audio to it.

attention

The concurrency limit is configurable by Rev AI support. To adjust this limit, contact the support team at support@rev.ai.

Formats

Although the Streaming Speech-to-Text API technically supports all the formats supported by FFmpeg, it is recommended to send audio streams as raw audio, FLAC or WAV as other formats can result in slightly increased latency and inconsistent results.

HIPAA compliance

The API supports HIPAA-compliant processing. However, this feature is not activated by default and must be explicitly activated at account level. Learn more about Rev AI's HIPAA compliance and how to HIPAA-enable a Rev AI user account.

The API has the following limitations in HIPAA context:

  1. RTMP streams are not supported.

Error codes

WebSocket close messages have a range of default error codes that signal why the socket connection was closed. See RFC-6455 for a range of the pre-defined error codes.

In addition to these error codes, the following table defines Rev AI custom error codes in the 4xxx range. Some errors can be resolved simply by retrying the request. The table indicates which errors are likely to be resolved with successive retries.

Error Code Description Retry?
4001 Unauthorized. Returned when the provided access token is invalid. No
4002 Bad request. Returned when the connection’s content-type is invalid, metadata contains too many characters or the custom vocabulary does not exist with that id. No
4003 Insufficient credits. Returned when the client does not have enough credits to continue the streaming session. No
4010 Server shutting down. The connection was terminated due to the server shutting down. Yes
4013 No instance available. No available streaming instances were found. User should attempt to retry the connection later. Yes
4029 Too many requests. The number of concurrent connections exceeded the limit. Contact customer support to increase it. No
attention

It is recommended that the maximum number of retries be limited to 5 attempts per request.

Billing

For billing purposes we track two values during each stream: stream duration and audio duration. At the end of each stream you will be charged for the larger of the two, rounded up to the nearest second, with an absolute minimum of 15 seconds.

Audio duration (AD) refers to the number of seconds of audio that have been sent over the WebSocket. Stream duration (SD) refers to the number of real world seconds which have passed since the WebSocket connection was established.

Here are some examples:

  • AD: 4.1 seconds, SD: 4.1 seconds. Charged as 15 seconds.
  • AD: 14.1 seconds, SD: 14.1 seconds. Charged as 15 seconds.
  • AD: 15 seconds, SD: 15 seconds. Charged as 15 seconds.
  • AD: 15 seconds, SD: 16 seconds. Charged as 16 seconds.
  • AD: 16.1 seconds, SD: 16.1 seconds. Charged as 17 seconds.
  • AD: 24.7 seconds, SD: 14 seconds. Charged as 24 seconds.
  • AD: 14 seconds, SD: 24 seconds. Charged as 24 seconds.

Learn more about billing and credits