Want to add speech recognition to your Ruby app? Google Cloud Speech API makes it straightforward. Let me show you how to set it up, handle edge cases, and get production-quality transcriptions working.
Why Ruby for Speech Recognition?
Ruby's clean syntax and rich gem ecosystem make it a solid choice for integrating with speech APIs. You won't be wrestling with verbose boilerplate code. Instead, you get readable implementations that are easy to maintain. If you are new to Ruby's ML ecosystem, our practical guide to integrating machine learning with Ruby provides a solid foundation. The google-cloud-speech gem wraps Google's gRPC API with an idiomatic Ruby interface, so you spend time on your application logic rather than serialization details.
Setting Up Google Cloud Speech API
Before writing any code, you need to configure your Google Cloud project:
- Create a project in Google Cloud Console
- Enable the Speech-to-Text API from the API library
- Create a service account and download the JSON credentials file
- Set the
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to point to your credentials file
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/credentials.json"Add the gem to your Gemfile:
gem 'google-cloud-speech', '~> 1.4'Run bundle install and you're ready to code.
Basic Transcription Example
Here's a working example that transcribes an audio file:
require "google/cloud/speech"
speech = Google::Cloud::Speech.speech
audio_file_path = "path/to/audio/file.wav"
audio = { uri: audio_file_path }
config = {
encoding: :LINEAR16,
sample_rate_hertz: 16_000,
language_code: "en-US"
}
response = speech.recognize config: config, audio: audio
results = response.results
alternatives = results.first.alternatives
alternatives.each do |alternative|
puts "Transcription: #{alternative.transcript}"
puts "Confidence: #{(alternative.confidence * 100).round(1)}%"
endLet's break down what's happening:
Google::Cloud::Speech.speechcreates a client instance using your credentials from the environment variableaudiopoints to your audio file. This can be a local path or ags://URI for files stored in Google Cloud Storageconfigspecifies the audio format. LINEAR16 is standard uncompressed WAVresponse.resultsreturns transcription alternatives ranked by confidence score
Transcribing Local Files with Binary Content
When your audio file lives on disk rather than in Cloud Storage, pass the raw bytes directly:
require "google/cloud/speech"
speech = Google::Cloud::Speech.speech
file_path = "recording.wav"
audio_content = File.binread(file_path)
audio = { content: audio_content }
config = {
encoding: :LINEAR16,
sample_rate_hertz: 16_000,
language_code: "en-US",
enable_automatic_punctuation: true
}
response = speech.recognize config: config, audio: audio
response.results.each do |result|
puts result.alternatives.first.transcript
endThe enable_automatic_punctuation option tells Google to insert periods, commas, and question marks. It makes raw transcriptions much more readable without any post-processing on your end.
Audio Format Considerations
The API is picky about audio formats. Here's what works best:
| Format | Encoding Symbol | Best For |
|---|---|---|
| WAV (PCM) | :LINEAR16 | Highest accuracy, uncompressed |
| FLAC | :FLAC | Good compression without quality loss |
| OGG Opus | :OGG_OPUS | Web and streaming audio |
| MP3 | :MP3 | Widely available but may reduce accuracy |
| AMR | :AMR | Phone call recordings |
Sample rate matters too. Most speech recordings work well at 16,000 Hz. Phone audio typically uses 8,000 Hz. Always match your config to your actual audio -- a mismatch will silently degrade transcription quality.
If you are unsure about the format of a WAV file, you can inspect it with the wavefile gem:
require "wavefile"
reader = WaveFile::Reader.new("recording.wav")
format = reader.native_format
puts "Channels: #{format.channels}"
puts "Sample rate: #{format.sample_rate}"
puts "Bits/sample: #{format.bits_per_sample}"
reader.closeUse those values directly in your Speech API config to avoid format mismatches.
Handling Long Audio Files
The synchronous recognize method has a 60-second limit. For longer recordings, use long_running_recognize. This submits the job and lets you poll for results:
require "google/cloud/speech"
speech = Google::Cloud::Speech.speech
# Long audio must be in Cloud Storage
audio = { uri: "gs://your-bucket/meeting-recording.flac" }
config = {
encoding: :FLAC,
sample_rate_hertz: 44_100,
language_code: "en-US",
enable_automatic_punctuation: true,
enable_word_time_offsets: true
}
operation = speech.long_running_recognize config: config, audio: audio
puts "Processing... this may take a while."
operation.wait_until_done!
if operation.error?
puts "Error: #{operation.error.message}"
else
operation.response.results.each do |result|
alt = result.alternatives.first
puts alt.transcript
# Word-level timestamps
alt.words.each do |word_info|
start_time = word_info.start_time.seconds + word_info.start_time.nanos / 1_000_000_000.0
puts " '#{word_info.word}' at #{start_time.round(2)}s"
end
end
endTwo things to note here. First, enable_word_time_offsets gives you the start and end time for every word -- useful for generating subtitles or syncing text with video. Second, long audio files must be stored in Google Cloud Storage. The API won't accept inline binary content for files over 60 seconds.
Error Handling
Production code needs to handle API failures gracefully. Network issues, invalid audio, and quota limits all happen:
require "google/cloud/speech"
def transcribe(file_path, language: "en-US")
speech = Google::Cloud::Speech.speech
audio = { content: File.binread(file_path) }
config = {
encoding: :LINEAR16,
sample_rate_hertz: 16_000,
language_code: language,
enable_automatic_punctuation: true
}
response = speech.recognize config: config, audio: audio
if response.results.empty?
{ success: false, error: "No speech detected in audio" }
else
transcript = response.results.map { |r| r.alternatives.first.transcript }.join(" ")
confidence = response.results.map { |r| r.alternatives.first.confidence }.min
{ success: true, transcript: transcript, confidence: confidence }
end
rescue Google::Cloud::InvalidArgumentError => e
{ success: false, error: "Invalid audio format: #{e.message}" }
rescue Google::Cloud::ResourceExhaustedError
{ success: false, error: "API quota exceeded. Try again later." }
rescue Google::Cloud::Error => e
{ success: false, error: "API error: #{e.message}" }
end
result = transcribe("interview.wav")
if result[:success]
puts result[:transcript]
puts "Confidence: #{(result[:confidence] * 100).round(1)}%"
else
puts "Failed: #{result[:error]}"
endThe key errors to watch for are InvalidArgumentError (wrong encoding or sample rate), ResourceExhaustedError (you hit your quota), and generic Google::Cloud::Error for network and server issues.
Multi-Language and Speaker Detection
If your audio contains multiple languages or speakers, the API can help:
config = {
encoding: :LINEAR16,
sample_rate_hertz: 16_000,
language_code: "en-US",
alternative_language_codes: ["es-ES", "fr-FR"],
diarization_config: {
enable_speaker_diarization: true,
min_speaker_count: 2,
max_speaker_count: 4
}
}Speaker diarization labels each word with a speaker tag, so you can reconstruct who said what. This is valuable for meeting transcriptions or interview processing.
Handling Real-World Challenges
Speech recognition isn't perfect. Here are common issues and practical workarounds:
Accents and dialects -- Use the appropriate language code. Google supports regional variants like en-GB, en-AU, and en-IN. You can also pass speech_contexts with phrases the API should bias toward, which helps with domain-specific vocabulary:
config = {
encoding: :LINEAR16,
sample_rate_hertz: 16_000,
language_code: "en-US",
speech_contexts: [{ phrases: ["Ruby on Rails", "ActiveRecord", "Sidekiq", "Puma"] }]
}This is especially useful for technical content where the default model might misinterpret jargon.
Background noise -- Clean audio produces better results. Consider preprocessing with SoX before sending to the API:
sox noisy.wav clean.wav noisered noise-profile 0.21Empty results -- If the API returns no results, check three things: the audio encoding matches your config, the sample rate is correct, and the audio actually contains speech in the specified language.
Privacy and Security
Voice data is sensitive. A few things to keep in mind:
- Audio is transmitted to Google's servers for processing. Use HTTPS (the gem handles this by default)
- Enable data logging controls in your Cloud project settings
- Consider on-premise solutions if you handle HIPAA-protected or similarly regulated data
- Delete audio files after processing if you don't need them
- Rotate your service account credentials periodically
What Can You Build?
Once you have transcription working, the possibilities open up:
- Voice commands -- Build a CLI tool that accepts spoken input for hands-free operation
- Transcription service -- Process podcast episodes or lecture recordings into searchable text. For an AWS-based alternative, see automating audio transcription with Ruby and Amazon Transcribe
- Call analytics -- Analyze customer service calls for sentiment and keyword extraction. You could pair this with a Bayesian text classifier built in Ruby for automatic categorization
- Accessibility -- Generate real-time captions for deaf or hard-of-hearing users
- Meeting notes -- Combine speaker diarization with transcription to produce structured meeting summaries
Wrapping Up
Speech recognition in Ruby is practical and production-ready. The google-cloud-speech gem handles authentication, audio encoding, and gRPC communication. You get to focus on what matters: building features your users want. Start with the synchronous recognize method for short audio, graduate to long_running_recognize for longer files, and add error handling before shipping to production.