Best Practices

This section is designed to provide guidelines for achieving the best speech-to-text results possible when using the API.

Formats

Rev AI uses FFmpeg and therefore technically supports all the formats supported by FFmpeg. However, it is recommended to send audio streams as raw audio, FLAC or WAV as other formats can result in slightly increased latency and inconsistent results. PCM16 (raw audio with some specific parameters) is ideal and leads to the lowest latency of any format.

Sampling rate

If you control the audio source, record at 16kHz sample rate or higher. We can transcribe audio as low as 8kHz as well. Do not up-sample or down-sample audio. Submit the audio in its original format.

Mono-compatible audio

Ensure stereo audio streams do not contain phase cancellation between channels. Phase cancellation occurs when stereo channels have inverted polarity, causing audio to cancel out when mixed to mono during processing. This results in silence or severely reduced volume, producing incomplete or empty transcripts.

warning

Test your audio in mono before streaming. If content becomes inaudible or significantly quieter when both channels are combined, correct the phase relationship in your recording setup before transcribing.

Preprocessing audio

Don't pre-process audio. This can distort the audio and reduce the transcript accuracy. Our speech engine is very robust and has been designed to handle a large variety of audio recordings.

Uncommon words

To improve recognition of uncommon words, such as proper names and special technical terms, submit a list of these words as custom vocabulary along with your request. Read the Custom Vocabulary API documentation for more details.