Speech Recognition with Ruby: Google Cloud API Guide

Alex Kholodniak

• 20 Oct 2023 • 7 min read

Want to add speech recognition to your Ruby app? Google Cloud Speech API makes it straightforward. Let me show you how to set it up, handle edge cases, and get production-quality transcriptions working.

Why Ruby for Speech Recognition?

Ruby's clean syntax and rich gem ecosystem make it a solid choice for integrating with speech APIs. You won't be wrestling with verbose boilerplate code. Instead, you get readable implementations that are easy to maintain. If you are new to Ruby's ML ecosystem, our practical guide to integrating machine learning with Ruby provides a solid foundation. The google-cloud-speech gem wraps Google's gRPC API with an idiomatic Ruby interface, so you spend time on your application logic rather than serialization details.

Setting Up Google Cloud Speech API

Before writing any code, you need to configure your Google Cloud project:

Create a project in Google Cloud Console
Enable the Speech-to-Text API from the API library
Create a service account and download the JSON credentials file
Set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your credentials file

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"

Add the gem to your Gemfile:

gem 'google-cloud-speech', '~> 1.4'

Run bundle install and you're ready to code.

Basic Transcription Example

Here's a working example that transcribes an audio file:

require "google/cloud/speech"

speech = Google::Cloud::Speech.speech

audio_file_path = "path/to/audio/file.wav"
audio = { uri: audio_file_path }

config = {
  encoding:          :LINEAR16,
  sample_rate_hertz: 16_000,
  language_code:     "en-US"
}

response = speech.recognize config: config, audio: audio

results = response.results
alternatives = results.first.alternatives
alternatives.each do |alternative|
  puts "Transcription: #{alternative.transcript}"
  puts "Confidence:    #{(alternative.confidence * 100).round(1)}%"
end

Let's break down what's happening:

Google::Cloud::Speech.speech creates a client instance using your credentials from the environment variable
audio points to your audio file. This can be a local path or a gs:// URI for files stored in Google Cloud Storage
config specifies the audio format. LINEAR16 is standard uncompressed WAV
response.results returns transcription alternatives ranked by confidence score

Transcribing Local Files with Binary Content

When your audio file lives on disk rather than in Cloud Storage, pass the raw bytes directly:

require "google/cloud/speech"

speech = Google::Cloud::Speech.speech

file_path = "recording.wav"
audio_content = File.binread(file_path)
audio = { content: audio_content }

config = {
  encoding:                :LINEAR16,
  sample_rate_hertz:       16_000,
  language_code:           "en-US",
  enable_automatic_punctuation: true
}

response = speech.recognize config: config, audio: audio

response.results.each do |result|
  puts result.alternatives.first.transcript
end

The enable_automatic_punctuation option tells Google to insert periods, commas, and question marks. It makes raw transcriptions much more readable without any post-processing on your end.

Audio Format Considerations

The API is picky about audio formats. Here's what works best:

Format	Encoding Symbol	Best For
WAV (PCM)	`:LINEAR16`	Highest accuracy, uncompressed
FLAC	`:FLAC`	Good compression without quality loss
OGG Opus	`:OGG_OPUS`	Web and streaming audio
MP3	`:MP3`	Widely available but may reduce accuracy
AMR	`:AMR`	Phone call recordings

Sample rate matters too. Most speech recordings work well at 16,000 Hz. Phone audio typically uses 8,000 Hz. Always match your config to your actual audio -- a mismatch will silently degrade transcription quality.

If you are unsure about the format of a WAV file, you can inspect it with the wavefile gem:

require "wavefile"

reader = WaveFile::Reader.new("recording.wav")
format = reader.native_format
puts "Channels:    #{format.channels}"
puts "Sample rate: #{format.sample_rate}"
puts "Bits/sample: #{format.bits_per_sample}"
reader.close

Use those values directly in your Speech API config to avoid format mismatches.

Handling Long Audio Files

The synchronous recognize method has a 60-second limit. For longer recordings, use long_running_recognize. This submits the job and lets you poll for results:

require "google/cloud/speech"

speech = Google::Cloud::Speech.speech

# Long audio must be in Cloud Storage
audio = { uri: "gs://your-bucket/meeting-recording.flac" }

config = {
  encoding:                     :FLAC,
  sample_rate_hertz:            44_100,
  language_code:                "en-US",
  enable_automatic_punctuation: true,
  enable_word_time_offsets:     true
}

operation = speech.long_running_recognize config: config, audio: audio

puts "Processing... this may take a while."
operation.wait_until_done!

if operation.error?
  puts "Error: #{operation.error.message}"
else
  operation.response.results.each do |result|
    alt = result.alternatives.first
    puts alt.transcript

    # Word-level timestamps
    alt.words.each do |word_info|
      start_time = word_info.start_time.seconds + word_info.start_time.nanos / 1_000_000_000.0
      puts "  '#{word_info.word}' at #{start_time.round(2)}s"
    end
  end
end

Two things to note here. First, enable_word_time_offsets gives you the start and end time for every word -- useful for generating subtitles or syncing text with video. Second, long audio files must be stored in Google Cloud Storage. The API won't accept inline binary content for files over 60 seconds.

Error Handling

Production code needs to handle API failures gracefully. Network issues, invalid audio, and quota limits all happen:

require "google/cloud/speech"

def transcribe(file_path, language: "en-US")
  speech = Google::Cloud::Speech.speech
  audio = { content: File.binread(file_path) }

  config = {
    encoding:          :LINEAR16,
    sample_rate_hertz: 16_000,
    language_code:     language,
    enable_automatic_punctuation: true
  }

  response = speech.recognize config: config, audio: audio

  if response.results.empty?
    { success: false, error: "No speech detected in audio" }
  else
    transcript = response.results.map { |r| r.alternatives.first.transcript }.join(" ")
    confidence = response.results.map { |r| r.alternatives.first.confidence }.min
    { success: true, transcript: transcript, confidence: confidence }
  end
rescue Google::Cloud::InvalidArgumentError => e
  { success: false, error: "Invalid audio format: #{e.message}" }
rescue Google::Cloud::ResourceExhaustedError
  { success: false, error: "API quota exceeded. Try again later." }
rescue Google::Cloud::Error => e
  { success: false, error: "API error: #{e.message}" }
end

result = transcribe("interview.wav")
if result[:success]
  puts result[:transcript]
  puts "Confidence: #{(result[:confidence] * 100).round(1)}%"
else
  puts "Failed: #{result[:error]}"
end

The key errors to watch for are InvalidArgumentError (wrong encoding or sample rate), ResourceExhaustedError (you hit your quota), and generic Google::Cloud::Error for network and server issues.

Multi-Language and Speaker Detection

If your audio contains multiple languages or speakers, the API can help:

config = {
  encoding:          :LINEAR16,
  sample_rate_hertz: 16_000,
  language_code:     "en-US",
  alternative_language_codes: ["es-ES", "fr-FR"],
  diarization_config: {
    enable_speaker_diarization: true,
    min_speaker_count: 2,
    max_speaker_count: 4
  }
}

Speaker diarization labels each word with a speaker tag, so you can reconstruct who said what. This is valuable for meeting transcriptions or interview processing.

Handling Real-World Challenges

Speech recognition isn't perfect. Here are common issues and practical workarounds:

Accents and dialects -- Use the appropriate language code. Google supports regional variants like en-GB, en-AU, and en-IN. You can also pass speech_contexts with phrases the API should bias toward, which helps with domain-specific vocabulary:

config = {
  encoding:          :LINEAR16,
  sample_rate_hertz: 16_000,
  language_code:     "en-US",
  speech_contexts:   [{ phrases: ["Ruby on Rails", "ActiveRecord", "Sidekiq", "Puma"] }]
}

This is especially useful for technical content where the default model might misinterpret jargon.

Background noise -- Clean audio produces better results. Consider preprocessing with SoX before sending to the API:

sox noisy.wav clean.wav noisered noise-profile 0.21

Empty results -- If the API returns no results, check three things: the audio encoding matches your config, the sample rate is correct, and the audio actually contains speech in the specified language.

Privacy and Security

Voice data is sensitive. A few things to keep in mind:

Audio is transmitted to Google's servers for processing. Use HTTPS (the gem handles this by default)
Enable data logging controls in your Cloud project settings
Consider on-premise solutions if you handle HIPAA-protected or similarly regulated data
Delete audio files after processing if you don't need them
Rotate your service account credentials periodically

What Can You Build?

Once you have transcription working, the possibilities open up:

Voice commands -- Build a CLI tool that accepts spoken input for hands-free operation
Transcription service -- Process podcast episodes or lecture recordings into searchable text. For an AWS-based alternative, see automating audio transcription with Ruby and Amazon Transcribe
Call analytics -- Analyze customer service calls for sentiment and keyword extraction. You could pair this with a Bayesian text classifier built in Ruby for automatic categorization
Accessibility -- Generate real-time captions for deaf or hard-of-hearing users
Meeting notes -- Combine speaker diarization with transcription to produce structured meeting summaries

Wrapping Up

Speech recognition in Ruby is practical and production-ready. The google-cloud-speech gem handles authentication, audio encoding, and gRPC communication. You get to focus on what matters: building features your users want. Start with the synchronous recognize method for short audio, graduate to long_running_recognize for longer files, and add error handling before shipping to production.