Need to convert audio files to text programmatically? Amazon Transcribe is a managed speech-to-text service that handles the machine learning complexity for you. This guide walks through a complete Ruby integration: from IAM setup and S3 uploads to polling for results, parsing the output JSON, handling errors, and keeping costs under control.
Prerequisites
You need an AWS account with billing enabled. Amazon Transcribe requires audio files to live in S3, so you will work with two AWS services: S3 for storage and Transcribe for the actual speech-to-text conversion.
If you are new to packaging and distributing Ruby libraries, our guide to building your first Ruby gem covers the fundamentals. Install both gems:
gem install aws-sdk-transcribeservice aws-sdk-s3Or add them to your Gemfile:
gem 'aws-sdk-transcribeservice'
gem 'aws-sdk-s3'IAM Policy Setup
Before writing any Ruby code, your IAM user (or role) needs the right permissions. Create a policy with these minimal permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"transcribe:StartTranscriptionJob",
"transcribe:GetTranscriptionJob",
"transcribe:DeleteTranscriptionJob",
"transcribe:ListTranscriptionJobs"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::your-audio-bucket",
"arn:aws:s3:::your-audio-bucket/*"
]
}
]
}Attach this policy to your IAM user via the AWS Console or CLI. The S3 permissions are required because Transcribe reads audio directly from your bucket.
Configure AWS Credentials
Set your credentials via environment variables:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1Or create ~/.aws/credentials:
[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
region = us-east-1The SDK picks these up automatically. In production, prefer IAM roles attached to your EC2 instance or ECS task rather than long-lived access keys.
Upload Audio to S3
Transcribe cannot process local files directly. You need to upload them to S3 first. Here is a helper that uploads a local file and returns the S3 URI:
require 'aws-sdk-s3'
def upload_audio(file_path, bucket:, region: 'us-east-1')
s3 = Aws::S3::Client.new(region: region)
key = "audio/#{File.basename(file_path)}"
File.open(file_path, 'rb') do |file|
s3.put_object(bucket: bucket, key: key, body: file)
end
"s3://#{bucket}/#{key}"
end
s3_uri = upload_audio('/tmp/interview.mp3', bucket: 'my-audio-bucket')
puts s3_uri # => s3://my-audio-bucket/audio/interview.mp3Supported audio formats include MP3, MP4, WAV, FLAC, OGG, AMR, and WebM. WAV and FLAC tend to produce better results because they are lossless, but MP3 works fine for most use cases.
Start a Transcription Job
Here is a minimal working example that kicks off a transcription:
require 'aws-sdk-transcribeservice'
client = Aws::TranscribeService::Client.new(region: 'us-east-1')
job_name = "transcription-#{Time.now.to_i}"
audio_uri = 's3://your-bucket/audio.mp3'
client.start_transcription_job(
transcription_job_name: job_name,
language_code: 'en-US',
media_format: 'mp3',
media: { media_file_uri: audio_uri }
)
puts "Job started: #{job_name}"Job names must be unique within your account and region. Using a timestamp or SecureRandom.hex avoids collisions. The job runs asynchronously, so start_transcription_job returns immediately and you need to poll for completion.
Poll for Results
The job typically takes 20-50% of the audio duration to complete. A 10-minute recording might finish in 2-5 minutes. Poll with exponential backoff to avoid hammering the API:
def wait_for_transcription(client, job_name, timeout: 600)
start_time = Time.now
delay = 5
loop do
response = client.get_transcription_job(transcription_job_name: job_name)
status = response.transcription_job.transcription_job_status
case status
when 'COMPLETED'
return response.transcription_job
when 'FAILED'
raise "Transcription failed: #{response.transcription_job.failure_reason}"
end
if Time.now - start_time > timeout
raise "Timeout waiting for transcription after #{timeout}s"
end
sleep delay
delay = [delay * 1.5, 30].min
end
end
job = wait_for_transcription(client, job_name)The exponential backoff starts at 5 seconds and caps at 30 seconds. This keeps you under the API rate limit of 25 GetTranscriptionJob calls per second while still detecting completion quickly.
Parse the Transcription Output
When a job completes, Transcribe writes a JSON file to a temporary URL. The JSON structure contains more than just the raw text. It includes word-level timestamps and confidence scores:
require 'net/http'
require 'json'
def fetch_transcript(job)
uri = URI(job.transcript.transcript_file_uri)
response = Net::HTTP.get(uri)
JSON.parse(response)
end
result = fetch_transcript(job)
full_text = result['results']['transcripts'].first['transcript']
puts full_textFor more granular access, iterate over individual words with their timestamps and confidence:
def extract_words_with_timestamps(result)
items = result['results']['items']
items.filter_map do |item|
next unless item['type'] == 'pronunciation'
{
word: item['alternatives'].first['content'],
confidence: item['alternatives'].first['confidence'].to_f,
start_time: item['start_time'].to_f,
end_time: item['end_time'].to_f
}
end
end
words = extract_words_with_timestamps(result)
words.each do |w|
puts "#{w[:start_time]}s - #{w[:word]} (#{(w[:confidence] * 100).round(1)}%)"
endThis is useful for building subtitle files, highlighting low-confidence words for manual review, or syncing text with audio playback. For an alternative approach using Google Cloud, see our guide on speech recognition with Ruby.
Complete Reusable Class
Here is a production-ready class that ties everything together:
require 'aws-sdk-transcribeservice'
require 'aws-sdk-s3'
require 'net/http'
require 'json'
require 'securerandom'
class AudioTranscriber
def initialize(region: 'us-east-1')
@region = region
@client = Aws::TranscribeService::Client.new(region: region)
@s3 = Aws::S3::Client.new(region: region)
end
def transcribe(s3_uri, language: 'en-US', output_bucket: nil)
job_name = "job-#{SecureRandom.hex(8)}"
format = detect_format(s3_uri)
params = {
transcription_job_name: job_name,
language_code: language,
media_format: format,
media: { media_file_uri: s3_uri }
}
params[:output_bucket_name] = output_bucket if output_bucket
@client.start_transcription_job(params)
job = wait_for_completion(job_name)
fetch_result(job)
end
def upload_and_transcribe(file_path, bucket:, language: 'en-US')
key = "audio/#{SecureRandom.hex(4)}-#{File.basename(file_path)}"
File.open(file_path, 'rb') do |file|
@s3.put_object(bucket: bucket, key: key, body: file)
end
transcribe("s3://#{bucket}/#{key}", language: language)
end
private
def detect_format(uri)
ext = File.extname(uri).delete('.').downcase
%w[mp3 mp4 wav flac ogg amr webm].include?(ext) ? ext : 'mp3'
end
def wait_for_completion(job_name, timeout: 600)
deadline = Time.now + timeout
delay = 5
loop do
resp = @client.get_transcription_job(transcription_job_name: job_name)
job = resp.transcription_job
return job if job.transcription_job_status == 'COMPLETED'
raise "Failed: #{job.failure_reason}" if job.transcription_job_status == 'FAILED'
raise "Timeout after #{timeout}s" if Time.now > deadline
sleep delay
delay = [delay * 1.5, 30].min
end
end
def fetch_result(job)
uri = URI(job.transcript.transcript_file_uri)
data = JSON.parse(Net::HTTP.get(uri))
{
text: data['results']['transcripts'].first['transcript'],
items: data['results']['items'],
raw: data
}
end
end
transcriber = AudioTranscriber.new
result = transcriber.upload_and_transcribe(
'/tmp/meeting.mp3',
bucket: 'my-audio-bucket'
)
puts result[:text]The upload_and_transcribe method handles the full pipeline: upload to S3, start the job, poll, and return parsed results. The returned hash gives you the full text, individual items with timestamps, and the raw JSON if you need it.
Error Handling
Several things can go wrong: invalid audio formats, permissions issues, rate limits, or network failures. Handle them explicitly:
def safe_transcribe(s3_uri, retries: 2)
transcriber = AudioTranscriber.new
transcriber.transcribe(s3_uri)
rescue Aws::TranscribeService::Errors::BadRequestException => e
puts "Invalid request: #{e.message}"
nil
rescue Aws::TranscribeService::Errors::LimitExceededException => e
if retries > 0
sleep 10
safe_transcribe(s3_uri, retries: retries - 1)
else
puts "Rate limited after retries: #{e.message}"
nil
end
rescue Aws::TranscribeService::Errors::ConflictException => e
puts "Job name conflict: #{e.message}"
nil
rescue Aws::S3::Errors::AccessDenied => e
puts "S3 access denied. Check IAM permissions: #{e.message}"
nil
rescue StandardError => e
puts "Transcription error: #{e.message}"
nil
endThe most common failure is BadRequestException caused by unsupported audio formats or an S3 URI that Transcribe cannot access. Double-check that your IAM policy grants Transcribe read access to the bucket.
Supported Languages
Amazon Transcribe supports over 100 languages and dialects. Pass the correct language code when starting a job:
'en-US' # English (US)
'en-GB' # English (UK)
'es-ES' # Spanish
'fr-FR' # French
'de-DE' # German
'pt-BR' # Portuguese (Brazil)
'ja-JP' # Japanese
'zh-CN' # Chinese (Mandarin)If you are unsure of the language, you can enable automatic language identification by replacing language_code with identify_language: true. Transcribe will detect the dominant language in the audio.
Cost Considerations
Amazon Transcribe charges per second of audio processed. As of early 2025, standard batch transcription costs $0.024 per minute ($1.44 per hour) for the first 250,000 minutes per month, dropping to $0.015 per minute after that. Here are practical ways to control costs:
Set an output bucket. Store results in your own S3 bucket so you can re-read them without re-running the job:
client.start_transcription_job(
transcription_job_name: job_name,
language_code: 'en-US',
media_format: 'mp3',
media: { media_file_uri: audio_uri },
output_bucket_name: 'my-transcripts-bucket'
)Use mono audio at 16kHz. Stereo audio does not improve transcription accuracy for single-speaker recordings. Downsampling with FFmpeg before upload saves storage and processing time:
ffmpeg -i input.wav -ac 1 -ar 16000 output.wavClean up completed jobs. Transcribe stores job metadata (not audio) indefinitely. Delete old jobs to keep your account tidy:
client.delete_transcription_job(transcription_job_name: job_name)Cache transcription results. If you process the same audio files repeatedly (e.g., during development), store the result JSON locally or in a database and skip the API call on subsequent runs.
Speaker Diarization
For meetings or interviews with multiple speakers, enable speaker identification:
client.start_transcription_job(
transcription_job_name: job_name,
language_code: 'en-US',
media_format: 'mp3',
media: { media_file_uri: audio_uri },
settings: {
show_speaker_labels: true,
max_speaker_labels: 4
}
)The results JSON will include a speaker_labels section with segments that map time ranges to speaker identifiers (spk_0, spk_1, etc.). Set max_speaker_labels to the expected number of speakers for best accuracy.
Next Steps
You now have a working Ruby pipeline for audio transcription. From here, consider adding:
- Custom vocabularies for domain-specific terms (medical, legal, technical jargon) that Transcribe might otherwise miss. Combining transcription with machine learning in Ruby opens up further analysis possibilities
- Content redaction to automatically mask PII like names, addresses, and credit card numbers
- Real-time streaming via WebSocket for live audio transcription
- Vocabulary filters to automatically mask or remove profanity
Check the AWS Transcribe documentation for the full API reference and feature details.