Ruby Audio Transcription with Amazon Transcribe

Alex Kholodniak

• 15 Nov 2023 • 8 min read

Need to convert audio files to text programmatically? Amazon Transcribe is a managed speech-to-text service that handles the machine learning complexity for you. This guide walks through a complete Ruby integration: from IAM setup and S3 uploads to polling for results, parsing the output JSON, handling errors, and keeping costs under control.

Prerequisites

You need an AWS account with billing enabled. Amazon Transcribe requires audio files to live in S3, so you will work with two AWS services: S3 for storage and Transcribe for the actual speech-to-text conversion.

If you are new to packaging and distributing Ruby libraries, our guide to building your first Ruby gem covers the fundamentals. Install both gems:

gem install aws-sdk-transcribeservice aws-sdk-s3

Or add them to your Gemfile:

gem 'aws-sdk-transcribeservice'
gem 'aws-sdk-s3'

IAM Policy Setup

Before writing any Ruby code, your IAM user (or role) needs the right permissions. Create a policy with these minimal permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "transcribe:StartTranscriptionJob",
        "transcribe:GetTranscriptionJob",
        "transcribe:DeleteTranscriptionJob",
        "transcribe:ListTranscriptionJobs"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-audio-bucket",
        "arn:aws:s3:::your-audio-bucket/*"
      ]
    }
  ]
}

Attach this policy to your IAM user via the AWS Console or CLI. The S3 permissions are required because Transcribe reads audio directly from your bucket.

Configure AWS Credentials

Set your credentials via environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_REGION=us-east-1

Or create ~/.aws/credentials:

[default]
aws_access_key_id = your_access_key
aws_secret_access_key = your_secret_key
region = us-east-1

The SDK picks these up automatically. In production, prefer IAM roles attached to your EC2 instance or ECS task rather than long-lived access keys.

Upload Audio to S3

Transcribe cannot process local files directly. You need to upload them to S3 first. Here is a helper that uploads a local file and returns the S3 URI:

require 'aws-sdk-s3'

def upload_audio(file_path, bucket:, region: 'us-east-1')
  s3 = Aws::S3::Client.new(region: region)
  key = "audio/#{File.basename(file_path)}"

  File.open(file_path, 'rb') do |file|
    s3.put_object(bucket: bucket, key: key, body: file)
  end

  "s3://#{bucket}/#{key}"
end

s3_uri = upload_audio('/tmp/interview.mp3', bucket: 'my-audio-bucket')
puts s3_uri # => s3://my-audio-bucket/audio/interview.mp3

Supported audio formats include MP3, MP4, WAV, FLAC, OGG, AMR, and WebM. WAV and FLAC tend to produce better results because they are lossless, but MP3 works fine for most use cases.

Start a Transcription Job

Here is a minimal working example that kicks off a transcription:

require 'aws-sdk-transcribeservice'

client = Aws::TranscribeService::Client.new(region: 'us-east-1')

job_name = "transcription-#{Time.now.to_i}"
audio_uri = 's3://your-bucket/audio.mp3'

client.start_transcription_job(
  transcription_job_name: job_name,
  language_code: 'en-US',
  media_format: 'mp3',
  media: { media_file_uri: audio_uri }
)

puts "Job started: #{job_name}"

Job names must be unique within your account and region. Using a timestamp or SecureRandom.hex avoids collisions. The job runs asynchronously, so start_transcription_job returns immediately and you need to poll for completion.

Poll for Results

The job typically takes 20-50% of the audio duration to complete. A 10-minute recording might finish in 2-5 minutes. Poll with exponential backoff to avoid hammering the API:

def wait_for_transcription(client, job_name, timeout: 600)
  start_time = Time.now
  delay = 5

  loop do
    response = client.get_transcription_job(transcription_job_name: job_name)
    status = response.transcription_job.transcription_job_status

    case status
    when 'COMPLETED'
      return response.transcription_job
    when 'FAILED'
      raise "Transcription failed: #{response.transcription_job.failure_reason}"
    end

    if Time.now - start_time > timeout
      raise "Timeout waiting for transcription after #{timeout}s"
    end

    sleep delay
    delay = [delay * 1.5, 30].min
  end
end

job = wait_for_transcription(client, job_name)

The exponential backoff starts at 5 seconds and caps at 30 seconds. This keeps you under the API rate limit of 25 GetTranscriptionJob calls per second while still detecting completion quickly.

Parse the Transcription Output

When a job completes, Transcribe writes a JSON file to a temporary URL. The JSON structure contains more than just the raw text. It includes word-level timestamps and confidence scores:

require 'net/http'
require 'json'

def fetch_transcript(job)
  uri = URI(job.transcript.transcript_file_uri)
  response = Net::HTTP.get(uri)
  JSON.parse(response)
end

result = fetch_transcript(job)

full_text = result['results']['transcripts'].first['transcript']
puts full_text

For more granular access, iterate over individual words with their timestamps and confidence:

def extract_words_with_timestamps(result)
  items = result['results']['items']

  items.filter_map do |item|
    next unless item['type'] == 'pronunciation'

    {
      word: item['alternatives'].first['content'],
      confidence: item['alternatives'].first['confidence'].to_f,
      start_time: item['start_time'].to_f,
      end_time: item['end_time'].to_f
    }
  end
end

words = extract_words_with_timestamps(result)
words.each do |w|
  puts "#{w[:start_time]}s - #{w[:word]} (#{(w[:confidence] * 100).round(1)}%)"
end

This is useful for building subtitle files, highlighting low-confidence words for manual review, or syncing text with audio playback. For an alternative approach using Google Cloud, see our guide on speech recognition with Ruby.

Complete Reusable Class

Here is a production-ready class that ties everything together:

require 'aws-sdk-transcribeservice'
require 'aws-sdk-s3'
require 'net/http'
require 'json'
require 'securerandom'

class AudioTranscriber
  def initialize(region: 'us-east-1')
    @region = region
    @client = Aws::TranscribeService::Client.new(region: region)
    @s3 = Aws::S3::Client.new(region: region)
  end

  def transcribe(s3_uri, language: 'en-US', output_bucket: nil)
    job_name = "job-#{SecureRandom.hex(8)}"
    format = detect_format(s3_uri)

    params = {
      transcription_job_name: job_name,
      language_code: language,
      media_format: format,
      media: { media_file_uri: s3_uri }
    }
    params[:output_bucket_name] = output_bucket if output_bucket

    @client.start_transcription_job(params)
    job = wait_for_completion(job_name)
    fetch_result(job)
  end

  def upload_and_transcribe(file_path, bucket:, language: 'en-US')
    key = "audio/#{SecureRandom.hex(4)}-#{File.basename(file_path)}"

    File.open(file_path, 'rb') do |file|
      @s3.put_object(bucket: bucket, key: key, body: file)
    end

    transcribe("s3://#{bucket}/#{key}", language: language)
  end

  private

  def detect_format(uri)
    ext = File.extname(uri).delete('.').downcase
    %w[mp3 mp4 wav flac ogg amr webm].include?(ext) ? ext : 'mp3'
  end

  def wait_for_completion(job_name, timeout: 600)
    deadline = Time.now + timeout
    delay = 5

    loop do
      resp = @client.get_transcription_job(transcription_job_name: job_name)
      job = resp.transcription_job

      return job if job.transcription_job_status == 'COMPLETED'
      raise "Failed: #{job.failure_reason}" if job.transcription_job_status == 'FAILED'
      raise "Timeout after #{timeout}s" if Time.now > deadline

      sleep delay
      delay = [delay * 1.5, 30].min
    end
  end

  def fetch_result(job)
    uri = URI(job.transcript.transcript_file_uri)
    data = JSON.parse(Net::HTTP.get(uri))

    {
      text: data['results']['transcripts'].first['transcript'],
      items: data['results']['items'],
      raw: data
    }
  end
end

transcriber = AudioTranscriber.new

result = transcriber.upload_and_transcribe(
  '/tmp/meeting.mp3',
  bucket: 'my-audio-bucket'
)
puts result[:text]

The upload_and_transcribe method handles the full pipeline: upload to S3, start the job, poll, and return parsed results. The returned hash gives you the full text, individual items with timestamps, and the raw JSON if you need it.

Error Handling

Several things can go wrong: invalid audio formats, permissions issues, rate limits, or network failures. Handle them explicitly:

def safe_transcribe(s3_uri, retries: 2)
  transcriber = AudioTranscriber.new
  transcriber.transcribe(s3_uri)
rescue Aws::TranscribeService::Errors::BadRequestException => e
  puts "Invalid request: #{e.message}"
  nil
rescue Aws::TranscribeService::Errors::LimitExceededException => e
  if retries > 0
    sleep 10
    safe_transcribe(s3_uri, retries: retries - 1)
  else
    puts "Rate limited after retries: #{e.message}"
    nil
  end
rescue Aws::TranscribeService::Errors::ConflictException => e
  puts "Job name conflict: #{e.message}"
  nil
rescue Aws::S3::Errors::AccessDenied => e
  puts "S3 access denied. Check IAM permissions: #{e.message}"
  nil
rescue StandardError => e
  puts "Transcription error: #{e.message}"
  nil
end

The most common failure is BadRequestException caused by unsupported audio formats or an S3 URI that Transcribe cannot access. Double-check that your IAM policy grants Transcribe read access to the bucket.

Supported Languages

Amazon Transcribe supports over 100 languages and dialects. Pass the correct language code when starting a job:

'en-US'  # English (US)
'en-GB'  # English (UK)
'es-ES'  # Spanish
'fr-FR'  # French
'de-DE'  # German
'pt-BR'  # Portuguese (Brazil)
'ja-JP'  # Japanese
'zh-CN'  # Chinese (Mandarin)

If you are unsure of the language, you can enable automatic language identification by replacing language_code with identify_language: true. Transcribe will detect the dominant language in the audio.

Cost Considerations

Amazon Transcribe charges per second of audio processed. As of early 2025, standard batch transcription costs $0.024 per minute ($1.44 per hour) for the first 250,000 minutes per month, dropping to $0.015 per minute after that. Here are practical ways to control costs:

Set an output bucket. Store results in your own S3 bucket so you can re-read them without re-running the job:

client.start_transcription_job(
  transcription_job_name: job_name,
  language_code: 'en-US',
  media_format: 'mp3',
  media: { media_file_uri: audio_uri },
  output_bucket_name: 'my-transcripts-bucket'
)

Use mono audio at 16kHz. Stereo audio does not improve transcription accuracy for single-speaker recordings. Downsampling with FFmpeg before upload saves storage and processing time:

ffmpeg -i input.wav -ac 1 -ar 16000 output.wav

Clean up completed jobs. Transcribe stores job metadata (not audio) indefinitely. Delete old jobs to keep your account tidy:

client.delete_transcription_job(transcription_job_name: job_name)

Cache transcription results. If you process the same audio files repeatedly (e.g., during development), store the result JSON locally or in a database and skip the API call on subsequent runs.

Speaker Diarization

For meetings or interviews with multiple speakers, enable speaker identification:

client.start_transcription_job(
  transcription_job_name: job_name,
  language_code: 'en-US',
  media_format: 'mp3',
  media: { media_file_uri: audio_uri },
  settings: {
    show_speaker_labels: true,
    max_speaker_labels: 4
  }
)

The results JSON will include a speaker_labels section with segments that map time ranges to speaker identifiers (spk_0, spk_1, etc.). Set max_speaker_labels to the expected number of speakers for best accuracy.

Next Steps

You now have a working Ruby pipeline for audio transcription. From here, consider adding:

Custom vocabularies for domain-specific terms (medical, legal, technical jargon) that Transcribe might otherwise miss. Combining transcription with machine learning in Ruby opens up further analysis possibilities
Content redaction to automatically mask PII like names, addresses, and credit card numbers
Real-time streaming via WebSocket for live audio transcription
Vocabulary filters to automatically mask or remove profanity

Check the AWS Transcribe documentation for the full API reference and feature details.