# Requests There are two ways to interact with the Streaming Speech-to-Text API: - WebSocket protocol - RTMP streams All connections to this API start as either a WebSocket handshake HTTP request (for WebSocket streams) or an HTTP POST request (for RTMP streams). On successful authorization, the client can start sending binary WebSocket messages containing audio data or an RTMP audio stream in one of our supported formats. As speech is detected, Rev AI returns hypotheses of the recognized speech content. ## Request parameters A WebSocket request to the Streaming Speech-to-Text API consists of the following parts: | | Request parameter | Required | Default | | --- | --- | --- | --- | | Base URL (WebSocket) or `read_url` URL (RTMP) | | Yes | | | [Access Token](#access-token) | `access_token` | Yes | None | | [Content Type](#content-type) | `content_type` | Yes | None | | [Language](#language) | `language` | No | `en` | | [Metadata](#metadata) | `metadata` | No | None | | [Custom Vocabulary](#custom-vocabulary) | `custom_vocabulary_id` | No | None | | [Profanity Filter](#profanity-filter) | `filter_profanity` | No | `false` | | [Disfluencies](#disfluencies) | `remove_disfluencies` | No | `false` | | [Delete After Seconds](#delete-after-seconds) | `delete_after_seconds` | No | None | | [Detailed Partials](#detailed-partials) | `detailed_partials` | No | `false` | | [Start Timestamp](#start-timestamp) | `start_ts` | No | None | | [Maximum segment duration seconds](#maximum-segment-duration-seconds) | `max_segment_duration_seconds` | No | None | | [Transcriber](#transcriber) | `transcriber` | No | See transcriber section | | [Speaker switch detection](#speaker-switch-detection) | `enable_speaker_switch` | No | `false` | | [Skip Post-processing](#skip-post-processing) | `skip_postprocessing` | No | `false` | | [Priority](#priority) | `priority` | No | `speed` | | [Maximum wait time for connection](#maximum-wait-time-for-connection) | `max_connection_wait_seconds` | No | `60` | ### Access token Clients must authenticate by including their [Rev AI access token](https://www.rev.ai/access-token) as a query parameter in their requests. If `access_token` is invalid or the query parameter is not present, the WebSocket connection will be closed with code `4001`. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1 ``` ### Content type All requests must also contain a `content_type` query parameter. The content type describes the format of audio data being sent. If you are submitting raw audio, Rev AI requires extra parameters as shown below. If the content type is invalid or not set, the WebSocket connection is closed with a `4002 close` code. Rev AI officially supports these content types: - `audio/x-raw` (has [additional requirements](#raw-file-content-type)) - `audio/x-flac` - `audio/x-wav` #### RAW file content type You are required to provide additional information in `content_type` when `content_type` is `audio/x-raw`. | Parameter (type) | Description | Allowed Values | Required | | --- | --- | --- | --- | | layout (string) | The layout of channels within a buffer. Possible values are "interleaved" (for LRLRLRLR) and "non-interleaved" (LLLLRRRR). Not case-sensitive | interleaved,non-interleaved | audio/x-raw only | | rate (int) | Sample rate of the audio bytes | Inclusive Range from 8000-48000Hz | audio/x-raw only | | format (string) | Format of the audio samples. Case-sensitive. See `Allowed Values` column for valid values | [List of valid formats](https://gstreamer.freedesktop.org/documentation/additional/design/mediatype-audio-raw.html?gi-language=c#formats) | audio/x-raw only | | channels (int) | Number of audio channels that the audio samples contain | Inclusive range from 1-10 channels | audio/x-raw only | These parameters follow the `content_type`, delimited by semi-colons (`;`). Each parameter should be specified in the format `parameter_name=parameter_value`. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata= ``` ### Language Custom Prices (other than the default) are set independently by language. Please refer to your contract for pricing information. If you are not a contract customer the pricing is found [here](https://rev.ai/pricing) Specify the transcription language with the `language` query parameter. When the language is not provided, transcription will default to English. The `language` query parameter cannot be used along with the following options: `filter_profanity`, `remove_disfluencies`, and `custom_vocabulary_id`. | Language | Language Code | | --- | --- | | English | `en` | | French | `fr` | | German | `de` | | Italian | `it` | | Japanese | `ja` | | Korean | `ko` | | Mandarin | `cmn` | | Portuguese | `pt` | | Spanish | `es` | Additional requirements for content type: - `content_type` must be `audio/x-raw` or `audio/x-flac` - when providing raw audio, it must be formatted as `S16LE` - `rate` must be included, regardless of content_type #### Examples ```bash # WebSocket protocol with audio/x-raw wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&language= # WebSocket protocol with audio/x-flac wss://api.rev.ai/speechtotext/v1/stream?access_token=REVAI_ACCESS_TOKEN&content_type=audio/x-flac;rate=16000&language= ``` ### Metadata Metadata to be associated with the job may be provided via the `metadata` query parameter. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&metadata= ``` ### Custom vocabulary If your streaming session contains domain-specific words or phrases that may not be in our dictionary, you can improve the accuracy of your transcript by creating a custom vocabulary to include with your streaming session. The custom vocabulary `id` can be included as a query parameter in the request to the Streaming Speech-to-Text API. The API will now be able to recognize the new word every time it is spoken. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-wav&custom_vocabulary_id=cv5ZqltdU0lFA4 ``` Learn more about [using a custom vocabulary](/api/custom-vocabulary/get-started). ### Profanity filter Optionally, you can filter profanity by adding the `filter_profanity=true` query parameter to your streaming request. This setting is disabled by default. We filter approximately 600 profanities, which covers most use cases. If a transcribed word matches a word on this list, all characters of the word will be replaced by asterisks except for the first and last character. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&filter_profanity=true ``` Here is an example partial hypothesis response with filtered profanity: ```json { "type": "partial", "ts": 0.0, "end_ts": 1.43, "elements": [ { "type": "text", "value": "a*****e" } ], } ``` ### Disfluencies You can choose to remove disfluencies from showing up in the resulting transcript by adding the `remove_disfluencies=true` query parameter to your streaming request. This setting is optional and defaults to `false`. Currently, disfluencies are defined as 'ums' and 'uhs'. When this `remove_disfluencies` is set to `true`, disfluencies will not appear in either partial or final hypothesis. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&remove_disfluencies=true ``` Here is an example partial hypothesis response with disfluencies: ```json { "type": "partial", "ts": 0.0, "end_ts": 3.24, "elements": [ { "type": "text", "value": "life" }, { "type": "text", "value": "uh" }, { "type": "text", "value": "finds" }, { "type": "text", "value": "a" }, { "type": "text", "value": "way" } ], } ``` Here is the partial hypothesis response for the same audio without disfluencies: ```json { "type": "partial", "ts": 0.0, "end_ts": 3.24, "elements": [ { "type": "text", "value": "life" }, { "type": "text", "value": "finds" }, { "type": "text", "value": "a" }, { "type": "text", "value": "way" } ] } ``` ### Delete after seconds You can specify how many seconds after completion the job should be deleted, by adding `delete_after_seconds` query parameter to your streaming request. The number of seconds provided must range from `0` seconds to `2592000` seconds (30 days). It may take up to 2 minutes after the scheduled time for the job to be deleted unless `delete_after_seconds = 0`. When `delete_after_seconds = 0`, the audio and transcript are immediately deleted and are never stored. The `delete_after_seconds` parameter is optional but, when provided, it overrides the auto delete preference set on your Rev AI account. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&delete_after_seconds=SECONDS ``` ### Detailed partials You can choose to receive timestamps and confidence scores in the partial hypotheses by adding the query parameter `detailed_partials=true`. This setting is disabled by default. The `detailed_partials` parameter enables usage of values that are otherwise only available in the final hypotheses. For example: - Timestamps in partials can be used to display transcribed words earlier. - Confidence scores offer the option to write custom logic for displaying the transcribed words. When using `detailed_partials`, you can expect a slight degradation of 1% in word error rate for the final hypothesis. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&detailed_partials=true ``` Here is an example partial hypothesis response without additional details: ```json { "type": "partial", "ts": 0.0, "end_ts": 2.18, "elements": [ { "type": "text", "value": "life" }, { "type": "text", "value": "begins" } ] } ``` Here is an example partial hypothesis response for the same response with additional details: ```json { "type": "partial", "ts": 0.0, "end_ts": 2.18, "elements": [ { "type": "text", "value": "life", "ts": 0.0, "end_ts": 1.5, "confidence": 0.83 }, { "type": "text", "value": "begins", "ts": 1.5, "end_ts": 1.98, "confidence": 0.7 } ] } ``` ### Start timestamp You can provide a starting timestamp to offset all hypotheses timings by adding `start_ts` as a query parameter to the request. If provided, all output hypotheses will have their `ts` and `end_ts` offset by the amount of seconds provided in the `start_ts` parameter. The `start_ts` parameter must be a positive double value. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&start_ts=60.5 ``` Here is an example response without a specified start timestamp: ```json { "type": "final", "ts": 1.01, "end_ts": 3.2, "elements": [ { "type": "text", "value": "One", "ts": 1.04, "end_ts": 1.55, "confidence": 1.0 }, { "type": "punct", "value": " " }, { "type": "text", "value": "two", "ts": 1.84, "end_ts": 2.15, "confidence": 1.0 }, { "type": "punct", "value": "." } ] } ``` Here is an example response without the start timestamp set to `60.5`: ```json { "type": "final", "ts": 61.51, "end_ts": 63.7, "elements": [ { "type": "text", "value": "One", "ts": 61.54, "end_ts": 62.05, "confidence": 1.0 }, { "type": "punct", "value": " " }, { "type": "text", "value": "two", "ts": 62.34, "end_ts": 62.65, "confidence": 1.0 }, { "type": "punct", "value": "." } ] } ``` ### Maximum segment duration seconds You can provide a maximum time you are willing to wait between final hypotheses, by adding the `max_segment_duration_seconds` query parameter to your streaming request. The number of seconds provided must range from `5` seconds to `30` seconds. This parameter potentially changes the amount of context our engine has when creating final hypotheses and therefore has a minor effect on word error rate. Higher values correlate with fewer errors in transcription. The `max_segment_duration_seconds` is not exact. Due to words potentially falling on the boundary of final hypotheses, you should expect the length of final hypotheses to be potentially 0.5 seconds longer than your specified value. For example, if you specify `max_segment_duration_seconds=5`, final hypotheses may be up to 5.5 seconds in length. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&max_segment_duration_seconds=15 ``` Output format doesn't change as a result of this parameter. ### Transcriber You can define a specific [transcription model](/api/streaming/transcribers) to use for your stream with the `transcriber` option. The output format does not change, regardless of which transcriber is selected. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2 ``` ### Speaker switch detection This option is only available for streams using the `transcriber=machine_v2` option. You can choose to enable speaker switch detection by adding `enable_speaker_switch=true` to your request. Doing so will add a new field named `speaker_id` to final hypotheses. Whenever the system detects that the active speaker in the audio has changed, the `speaker_id` field will be incremented to indicate this change. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&enable_speaker_switch=true ``` Here is an example response with speaker switch detection enabled: ```json { "type": "final", "ts": 1.01, "end_ts": 3.2, "speaker_id": 1000, "elements": [ { "type": "text", "value": "One", "ts": 1.04, "end_ts": 1.55, "confidence": 1.0 }, { "type": "punct", "value": " " }, { "type": "text", "value": "two", "ts": 1.84, "end_ts": 2.15, "confidence": 1.0 }, { "type": "punct", "value": "." } ] } ``` ### Skip Post-processing *Only available for English and Spanish languages* You can choose to skip post-processing operations, such as inverse text normalization (ITN), casing and punctuation, by adding `skip_postprocessing=true` to your request. Doing so will result in a small decrease in latency; however, your final hypotheses will no longer contain capitalization, punctuation, or inverse text normalization (for example, `five hundred` will not be normalized to `500`). #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&skip_postprocessing=true ``` Here is an example response with `skip_postprocessing` enabled: ```json { "type": "final", "ts": 1.01, "end_ts": 3.2, "elements": [ { "type": "text", "value": "five", "ts": 1.04, "end_ts": 1.55, "confidence": 1.0 }, { "type": "punct", "value": " " }, { "type": "text", "value": "hundred", "ts": 1.84, "end_ts": 2.15, "confidence": 1.0 }, { "type": "punct", "value": " " }, { "type": "text", "value": "dollars", "ts": 2.23, "end_ts": 2.87, "confidence": 1.0 } ] } ``` ### Priority *Only available for English and Spanish languages* *Only available for `machine_v2` transcriber* You can configure what our engine should prioritize for your stream. The available choices are `speed` and `accuracy`. - `speed` is the default and means the engine will prioritize creating partial and final hypotheses more frequently to reduce latency. - `accuracy` will cause the engine to produce results less frequently but will greatly increase the accuracy of results. #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=interleaved;rate=16000;format=S16LE;channels=1&transcriber=machine_v2&priority=accuracy ``` ### Maximum wait time for connection You can provide a maximum time you are willing to wait for a WebSocket connection to be available, by adding the `max_connection_wait_seconds` query parameter to your streaming request. The number of seconds provided must range from `60` seconds to `600` seconds. To handle the rare occasion that the API is unable to allocate a speech worker immediately, set this value to something greater than 60 seconds. This will help in prioritizing and assigning the requests to a speech worker in the order in which the requests were received, and scale up faster to meet the total number of requested connections. If this timeout is reached, the client will receive a WebSocket close message with [error code `4013`](/api/streaming#error-codes). This indicates that the API is unable to service the connection within the specified wait time and that the client should try again later. An increased connection latency does not mean an increased partial latency. Once connected, partial latencies will behave as normal. The client can buffer audio data during the connection wait period and send it as a burst of audio data messages after the connection to a worker is established. Audio will be processed faster than real-time and eventually partials will catch up to the client in real-time. The catch-up period is dependent on the amount of audio buffered, which is often a function of the buffered audio duration (the connection wait period). #### Example ```bash # WebSocket protocol wss://api.rev.ai/speechtotext/v1/stream?access_token=&content_type=audio/x-raw;layout=int&max_connection_wait_seconds=300 ``` ## Request stages ### RTMP streams #### Initial connection All requests begin as an HTTP `POST` request. An RTMP streaming session is requested with the user’s access token as a Bearer authentication token. The request body is a JSON document with streaming job options. ``` Client --> Rev AI POST /speechtotext/v1/live_stream/rtmp HTTP/1.1 Host: api.rev.ai Accept: */* Authorization: Bearer Content-Type: application/json Content-Length: 19 { "metadata":"test", "filter_profanity": "true", "detailed_partials": "true", "remove_disfluencies": "true", "language": "fr", "transcriber": "machine", "max_segment_duration_seconds": 20, "custom_vocabulary_id": "cv5ZqltdU0lFA4", "enable_speaker_switch": false, "skip_postprocessing": false, "priority": "accuracy" } ``` If authorization is successful, the client receives a JSON response which looks like the example response below: ```json { "ingestion_url":"rtmps://rtmp.rev.ai/streaming/v1", "stream_name":"wt=JFynCno&md=test&fp=True&rd=True&dp=True&cw=60", "read_url":"wss://api.rev.ai/speechtotext/v1/read_stream?read_token=ZZ-132" } ``` #### WebSocket reader connection A WebSocket connection must be opened to the `read_url` URL in order to receive streaming results. Once this connection is opened, results from your RTMP audio stream will be generated in the same [response format](/api/streaming/responses) as that for WebSocket streams. The WebSocket connection should be opened before the RTMP connection is opened in order to ensure no audio data is skipped. #### RTMP audio submission The RTMP audio can now be streamed to the `ingestion_url` URL endpoint provided in the initial response using the provided `stream_name` as the stream name for that session. For example, to stream RTMP audio using `ffmpeg`, use the following command, replacing the `` placeholder with the path to your audio file and the `` and `` placeholders with the values from the initial JSON response. ```bash ffmpeg -re -i -c copy -f flv / ``` When using `ffmpeg`, escape the `&` characters within the `` placeholder by placing a backslash (`\`) before each of them. ### WebSocket protocol #### Initial connection All requests begin as an HTTP `GET` request. A WebSocket request is declared by including the header value `Upgrade: websocket` and `Connection: Upgrade`. ``` Client --> Rev AI GET /speechtotext/v1/stream HTTP/1.1 Host: api.rev.ai Upgrade: websocket Connection: Upgrade Sec-WebSocket-Key: Chxzu/uTUCmjkFH9d/8NTg== Sec-WebSocket-Version: 13 Origin: http://api.rev.ai ``` If authorization is successful, the request is upgraded to a WebSocket connection. ``` Client <-- Rev AI HTTP/1.1 101 Switching Protocols Upgrade: websocket Connection: Upgrade Sec-WebSocket-Accept: z0pcAwXZZRVlMcca8lmHCPzvrKU= ``` After the connection has been upgraded, the servers will return a `"connected"` message. You must wait for this connected message before sending binary audio data. The response includes an `id`, which is the corresponding job identifier, as shown in the example below: ```json { "type": "connected", "id": s1d24ax2fd21 } ``` If Rev AI currently does not have the capacity to handle the request, a WebSocket close message is returned with status code of `4013`. A `HTTP/1.1 400 Bad Request` response indicates that the request is not a WebSocket upgrade request. #### Audio submission WebSocket messages sent to Rev AI must be of one of these two WebSocket message types: | Message type | Message requirements | Notes | | --- | --- | --- | | Binary | Audio data is transmitted as binary data and should be sent in chunks of 250ms or more.Streams sending audio chunks that are less than 250ms in size may experience increased transcription latency. | The format of the audio must match that specified in the [`content_type` parameter](#content-type). | | Text | The client should send an End-Of-Stream(`"EOS"`) text message to signal the end of audio data, and thus gracefully close the WebSocket connection.On an `EOS` message, Rev AI will return a final hypothesis along with a WebSocket close message. | Currently, only this one text message type is supported.[WebSocket close type](https://developer.mozilla.org/en-US/docs/Web/API/WebSocket/close) messages are explicitly not supported as a message type and will abruptly close the socket connection with a `1007 Invalid Payload` error. Clients will not receive their final hypothesis in this case.Any other text messages, including incorrectly capitalized messages such as `"eos"` and `"Eos"`, are invalid and will also close the socket connection with a `1007 Invalid Payload` error. | See [examples of the sequence of messages](/api/streaming/example-session) between a client and the Streaming Speech-to-Text API.