Integrate Human Transcription into ASR Applications

By Ajita Mishra, Product Manager and Vikram Vaswani, Developer Advocate - Apr 01, 2022

Introduction

At Rev, our customers can access two types of transcription services, depending on their requirements:

AI-based transcription: performed using automated speech recognition. While fast and relatively less expensive, its accuracy is impacted by various factors, such as background noise, speaker accents, etc.
Human transcription: performed by humans and up to 99% accurate. However, it is also more expensive and takes longer.

Thus far, AI-based transcription services have been delivered through the Rev AI APIs and human transcription services through the Rev.com website. However, with recent changes, it is now possible for developers to also access human transcription services through the Rev AI APIs. This tutorial explains this new feature and shows you how to get started with it.

Assumptions

This tutorial assumes that:

You have a Rev AI account and access token. If not, sign up for a free account and generate an access token .
You have an audio file to transcribe. If not, use this example audio file from Rev AI .

Overview

This new capability enables developers of downstream applications to provide transcripts with higher levels of accuracy than any ASR system can currently provide, while still retaining the option to obtain ASR-based transcripts as before, without significant additional integration time and effort.

In addition to requesting full human- or ASR-transcribed results, this new feature also enables developers to selectively mix and match both options and create hybrid models to flexibly meet end-user needs. For example, developers can request ASR-based transcription first and then selectively "upgrade" segments (of one minute or longer) of the ASR transcript to human quality.

Notably, timestamps of the human-transcribed segments will be aligned to those of the ASR transcription result, making it extremely easy to merge the results. All the standard job parameters— profanity filtering, punctuation, diarization— as well as custom vocabularies submitted through the API will be honored by the human transcriber.

Benefits

Enabling developers to access both types of transcription services through a unified API offers a number of benefits:

Simplified development: This new functionality enables developers to order both human and AI transcription through the same API. It's a single API endpoint, a single set of credentials, familiar payloads, no new documentation to read, and easier debugging. Essentially, if a developer has an existing Rev AI integration, that integration is already prepared to support human transcription.
Support for different use cases: End-users have different requirements around transcription quality, price, and turnaround time. By having the ability to fulfill both human and ASR-based transcription requests through the same API, developers need to spend less time and effort adapting and optimizing their applications for different use cases.

Use cases

Typically, this functionality is useful in scenarios when full human-quality transcription is required but a quick-turnaround ASR version provides base data to inform and optimize the parameters for the human transcription request. For example:

A television network or other content owner may need to find all coverage of, say, a famous athlete over the last 20 years. ASR transcription could be used to index and identify clips and times within those clips where the athlete was mentioned, and then human transcription could be used to create high-accuracy records of what was said by or about that athlete.
Earnings calls wherein listeners are interested in answers about "pricing" or "competition". ASR transcription could be used to prepare a rough draft of the discussion, followed by topic extraction to identify timestamps when these topics were mentioned and then precise human transcription to understand what was said during those chosen time segments.

Get started

How does it work in practice? As previously mentioned, if you're already familiar with our Asynchronous Speech-to-Text API, there are just a few tweaks you need to make to your application code.

To convert an ASR transcription job to a human transcription job, simply add the transcriber: human parameter to your request, as shown below. All other parameters remain the same. Replace the <REVAI_ACCESS_TOKEN> placeholder with your Rev AI access token. Here's an example:

Copy

Copied

curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \
--header 'Authorization: Bearer <REVAI_ACCESS_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"source_config": {"url": "https://www.rev.ai/FTC_Sample_1.mp3"},
"skip_diarization": false,
"skip_punctuation": false,
"transcriber": "human"
}'

To request human transcription for specific segments of the media, add the segments_to_transcribe parameter to your request with starting and ending timestamps for each segment. Here's another example:

Copy

Copied

curl --location --request POST 'https://api.rev.ai/speechtotext/v1/jobs' \
--header 'Authorization: Bearer <REVAI_ACCESS_TOKEN>' \
--header 'Content-Type: application/json' \
--data-raw '{
"source_config": {"url": "https://www.rev.ai/FTC_Sample_1.mp3"},
"skip_diarization": false,
"skip_punctuation": false,
"segments_to_transcribe": [
{"start": "1.00", "end": "340.00"},
{"start": "640.00", "end": "1010.50"},
],
"transcriber": "human"
}'

A number of parameters specific to human transcription have been added to the API specification, including parameters to raise the priority of the job, perform verbatim transcription, and so on. These parameters are ignored for machine-transcribed jobs. Refer to the API documentation for more details.

A few important points to note about human transcription jobs:

Human transcription is priced differently from AI-based transcription, with additional charges for priority and verbatim transcription. Learn more about pricing .
Transcript files will not be available through the "My Files" page on the Rev website and can’t be used with the Rev editor
For segment transcription, each segment must be at least one minute in length and segments cannot overlap
Estimated turnaround time is 12-24 hours
Custom vocabularies are limited to 20 terms

Next steps

This new functionality enables developers to provide end-users with additional flexibility and optimize their applications for differing requirements around quality, price, and turnaround time.

Learn more about this feature by visiting the following links:

Documentation: Asynchronous Speech-To-Text API job submission
Code samples: Asynchronous Speech-To-Text API
Documentation: Asynchronous Speech-To-Text API best practices
Documentation: Custom Vocabulary API