All transcript responses from the Streaming Speech-to-Text API are text messages and are returned as serialized JSON. The transcript response has two states:
partial hypothesis and
The JSON will contain a
type property which indicates what kind of response the message is. Valid values for this
type property are:
"connected" type is only returned once during the initial handshake when opening a WebSocket connection. All other responses should be of the type
Here is a brief description of the response object and its properties:
|ts||double||The start time of the hypothesis in seconds|
|end_ts||double||The end time of the hypothesis in seconds|
|elements||array of Elements||Only present if
While clients are streaming audio data, Rev AI processes and returns partial hypotheses. Partial hypotheses are the AI's best guess of what was said up to that moment in time.
Multiple partial hypotheses can be returned for the same audio segment. Partial hypotheses can return different individual words at different moments in time (see example)
Once the AI is confident in the transcript, a
final hypothesis will be delivered. When Rev AI returns a
final hypothesis, the output for that section of audio will no longer change.
These final hypotheses contains all the information of a
partial hypothesis, but the
elements will contain finer-grained details such as
confidence scores. The
timestamp will be measured in absolute time (relative to the start of the audio input).
The final transcript for a completed streaming session can also be obtained via the Get Transcript endpoint of the Asynchronous Speech-to-Text API when using the JSON response schema. The availability of this transcript is subject to the normal deletion control rules
See examples of the sequence of messages between a client and the Streaming Speech-to-Text API.