Sidekiq Background Jobs That Don't Fail: A Production Guide

Moving slow operations to background jobs seems simple. Create a job class, call perform_async, done. Then production happens.

Jobs fail silently. Duplicate charges hit customers. The retry queue grows to 50,000 jobs. Deployments cause data corruption. I've seen all of these.

Sidekiq is battle-tested infrastructure—it processes billions of jobs daily across thousands of companies. But writing reliable jobs requires understanding how Sidekiq actually works and designing around its guarantees (and lack thereof).

The One Thing You Must Understand

Sidekiq guarantees your job will run at least once. Not exactly once. At least once.

This means your job might run twice, three times, or more. Redis can crash after your job completes but before Sidekiq acknowledges completion. Network blips happen. Deploys restart workers mid-job.

Sidekiq Architecture - How jobs flow through Redis

If your job charges a credit card, it might charge twice. If it sends an email, duplicate emails. If it creates a database record, duplicate records.

The solution: design every job to be idempotent. Running it once or five times should produce the same result.

Idempotency Patterns That Work

Idempotency Pattern - Preventing duplicate job execution

Pattern 1: Check Before Acting

The simplest approach—check if the work was already done:

class SendWelcomeEmailJob
  include Sidekiq::Job

  def perform(user_id)
    user = User.find(user_id)

    # Guard: already sent?
    return if user.welcome_email_sent_at.present?

    UserMailer.welcome(user).deliver_now

    user.update!(welcome_email_sent_at: Time.current)
  end
end

This works but has a race condition. Two workers could both pass the guard before either updates the flag.

Pattern 2: Database Constraints

Use unique constraints to make duplicates impossible:

class ProcessPaymentJob
  include Sidekiq::Job

  def perform(order_id, idempotency_key)
    order = Order.find(order_id)

    # Create payment record with unique constraint on idempotency_key
    payment = Payment.create!(
      order: order,
      idempotency_key: idempotency_key,  # unique index
      amount: order.total,
      status: 'pending'
    )

    result = PaymentGateway.charge(
      amount: order.total,
      idempotency_key: idempotency_key  # Stripe also supports idempotency keys
    )

    payment.update!(
      status: result.success? ? 'completed' : 'failed',
      gateway_id: result.id
    )
  rescue ActiveRecord::RecordNotUnique
    # Already processed, safe to ignore
    Rails.logger.info("Payment already processed: #{idempotency_key}")
  end
end

# Enqueue with unique key
ProcessPaymentJob.perform_async(order.id, "order-#{order.id}-payment-#{SecureRandom.hex(8)}")

The database constraint guarantees only one payment processes, regardless of how many times the job runs.

Pattern 3: Idempotent State Machine

For multi-step processes, use explicit state transitions:

class FulfillOrderJob
  include Sidekiq::Job

  def perform(order_id)
    order = Order.find(order_id)

    case order.status
    when 'paid'
      reserve_inventory(order)
      order.update!(status: 'inventory_reserved')
      FulfillOrderJob.perform_async(order_id)  # Re-enqueue for next step

    when 'inventory_reserved'
      create_shipment(order)
      order.update!(status: 'shipped')

    when 'shipped'
      # Already complete, nothing to do
      Rails.logger.info("Order #{order_id} already shipped")

    else
      Rails.logger.warn("Order #{order_id} in unexpected state: #{order.status}")
    end
  end

  private

  def reserve_inventory(order)
    # Idempotent: check if already reserved
    return if InventoryReservation.exists?(order_id: order.id)

    order.line_items.each do |item|
      InventoryReservation.create!(
        order_id: order.id,
        product_id: item.product_id,
        quantity: item.quantity
      )
    end
  end

  def create_shipment(order)
    return if order.shipment.present?

    Shipment.create!(order: order, status: 'pending')
  end
end

Each state transition is idempotent. The job can restart at any point and resume correctly.

Argument Best Practices

Pass IDs, Not Objects

Never serialize Ruby objects into job arguments:

# BAD: Object state gets stale
class BadJob
  include Sidekiq::Job

  def perform(user)  # user is serialized at enqueue time
    user.charge_subscription  # Uses stale data!
  end
end

# GOOD: Fetch fresh data
class GoodJob
  include Sidekiq::Job

  def perform(user_id)
    user = User.find(user_id)  # Fresh from database
    user.charge_subscription
  end
end

Sidekiq serializes arguments to JSON. By the time the job runs (seconds, minutes, or hours later), that user object's data is stale.

Use Simple Types Only

Job arguments must be JSON-serializable:

# GOOD: Primitives
MyJob.perform_async(123, "hello", true, 45.67)

# GOOD: Simple arrays and hashes
MyJob.perform_async([1, 2, 3], { "key" => "value" })

# BAD: Symbols (become strings)
MyJob.perform_async(:status)  # Received as "status"

# BAD: Time objects
MyJob.perform_async(Time.current)  # Serialization issues

# GOOD: ISO8601 strings for times
MyJob.perform_async(Time.current.iso8601)

Handle Missing Records

Records can be deleted between enqueue and execution:

class ProcessUserJob
  include Sidekiq::Job

  def perform(user_id)
    user = User.find_by(id: user_id)

    unless user
      Rails.logger.warn("User #{user_id} not found, skipping")
      return
    end

    # Process user...
  end
end

Or use find and let ActiveRecord::RecordNotFound trigger a retry (if that's appropriate for your use case).

Retry Configuration

Default Behavior

Sidekiq retries failed jobs 25 times over about 21 days using exponential backoff:

Retry 1:  ~32 seconds
Retry 5:  ~10 minutes
Retry 10: ~2.5 hours
Retry 15: ~1 day
Retry 20: ~6 days
Retry 25: Dead queue (after ~21 days)

Customizing Retries

class CriticalJob
  include Sidekiq::Job

  # Retry more aggressively for critical jobs
  sidekiq_options retry: 10  # Max 10 retries

  def perform(id)
    # ...
  end
end

class NonCriticalJob
  include Sidekiq::Job

  # Don't retry at all
  sidekiq_options retry: false

  def perform(id)
    # ...
  end
end

class CustomBackoffJob
  include Sidekiq::Job

  sidekiq_options retry: 5

  # Custom retry delay
  sidekiq_retry_in do |count, exception|
    case exception
    when RateLimitError
      60 * 60  # Wait 1 hour for rate limits
    when NetworkError
      10 * (count + 1)  # Linear backoff for network issues
    else
      :exponential_backoff  # Default behavior
    end
  end

  def perform(id)
    # ...
  end
end

Handling Specific Errors

class ExternalApiJob
  include Sidekiq::Job

  sidekiq_options retry: 5

  # Don't retry certain errors
  sidekiq_retries_exhausted do |job, exception|
    Rails.logger.error("Job #{job['jid']} failed permanently: #{exception.message}")
    NotifyOpsTeam.alert("Background job failed", job: job, error: exception)
  end

  def perform(record_id)
    record = Record.find(record_id)

    begin
      ExternalApi.sync(record)
    rescue ExternalApi::NotFoundError
      # Record doesn't exist in external system, don't retry
      record.update!(sync_status: 'not_found')
      # Don't raise, job completes successfully
    rescue ExternalApi::RateLimitError => e
      # Retry with backoff
      raise e
    rescue ExternalApi::InvalidDataError => e
      # Our fault, don't retry, fix the code
      record.update!(sync_status: 'invalid_data', sync_error: e.message)
      # Don't raise
    end
  end
end

Queue Organization

Queue Priority - How Sidekiq processes queues by weight

Separate Queues by Priority

# config/sidekiq.yml
:queues:
  - [critical, 10]    # Weight 10 - processed first
  - [default, 5]      # Weight 5
  - [low, 1]          # Weight 1 - processed last

# Job classes
class PaymentJob
  include Sidekiq::Job
  sidekiq_options queue: :critical
end

class ReportJob
  include Sidekiq::Job
  sidekiq_options queue: :low
end

Dedicated Queues for Slow Jobs

Long-running jobs can block other jobs in the same queue:

# config/sidekiq.yml - separate process for heavy jobs
:queues:
  - heavy

# sidekiq_heavy.yml
:concurrency: 2  # Fewer threads for memory-heavy jobs
:queues:
  - heavy

class VideoProcessingJob
  include Sidekiq::Job
  sidekiq_options queue: :heavy
end

Run separate Sidekiq processes:

# Main process
bundle exec sidekiq -C config/sidekiq.yml

# Heavy jobs process
bundle exec sidekiq -C config/sidekiq_heavy.yml

Deployment Safety

The 25-Second Problem

During deployment, Sidekiq gets 25 seconds (configurable) to finish current jobs before being terminated. Unfinished jobs return to the queue and run again.

Sidekiq Retry Flow - How failed jobs are retried

If your job takes 30 seconds and gets killed at second 20, it will restart from the beginning. If it's not idempotent, you have a problem.

Solutions:

  1. Keep jobs short: Break long jobs into smaller pieces
  2. Use checkpointing: Save progress and resume
  3. Increase timeout: sidekiq_options timeout: 60

Checkpointing Long Jobs

class BulkImportJob
  include Sidekiq::Job
  sidekiq_options retry: 3

  BATCH_SIZE = 100

  def perform(import_id, offset = 0)
    import = Import.find(import_id)
    records = import.pending_records.offset(offset).limit(BATCH_SIZE)

    return if records.empty?  # Done

    records.each do |record|
      process_record(record)
    end

    # Re-enqueue for next batch
    BulkImportJob.perform_async(import_id, offset + BATCH_SIZE)
  end

  private

  def process_record(record)
    # Process single record (idempotent)
    return if record.processed?

    # ... processing logic ...

    record.update!(processed: true)
  end
end

If the job gets killed mid-batch, it restarts from the same offset. Already-processed records are skipped.

Monitoring in Production

Essential Metrics

Track these in your monitoring system:

# config/initializers/sidekiq.rb
Sidekiq.configure_server do |config|
  config.on(:startup) do
    # Export metrics every 30 seconds
    Thread.new do
      loop do
        stats = Sidekiq::Stats.new

        StatsD.gauge('sidekiq.processed', stats.processed)
        StatsD.gauge('sidekiq.failed', stats.failed)
        StatsD.gauge('sidekiq.enqueued', stats.enqueued)
        StatsD.gauge('sidekiq.retry_size', stats.retry_size)
        StatsD.gauge('sidekiq.dead_size', stats.dead_size)

        Sidekiq::Queue.all.each do |queue|
          StatsD.gauge("sidekiq.queue.#{queue.name}.size", queue.size)
          StatsD.gauge("sidekiq.queue.#{queue.name}.latency", queue.latency)
        end

        sleep 30
      end
    end
  end
end

Alerting Thresholds

# Example alerting rules
alerts:
  - name: sidekiq_queue_backing_up
    condition: sidekiq.queue.default.latency > 300  # 5 minutes
    severity: warning

  - name: sidekiq_retry_queue_growing
    condition: sidekiq.retry_size > 1000
    severity: critical

  - name: sidekiq_dead_jobs
    condition: delta(sidekiq.dead_size, 1h) > 10
    severity: warning

The Sidekiq Web UI

Mount it in your routes (protected by authentication):

# config/routes.rb
require 'sidekiq/web'

authenticate :user, ->(user) { user.admin? } do
  mount Sidekiq::Web => '/sidekiq'
end

Testing Sidekiq Jobs

Inline Mode for Tests

# spec/rails_helper.rb
RSpec.configure do |config|
  config.before(:each) do
    Sidekiq::Testing.fake!  # Jobs go to fake queue
  end
end

# spec/jobs/send_email_job_spec.rb
RSpec.describe SendEmailJob do
  it 'enqueues the job' do
    expect {
      SendEmailJob.perform_async(user.id)
    }.to change(SendEmailJob.jobs, :size).by(1)
  end

  it 'sends the email' do
    Sidekiq::Testing.inline! do  # Execute immediately
      expect {
        SendEmailJob.perform_async(user.id)
      }.to change { ActionMailer::Base.deliveries.count }.by(1)
    end
  end
end

Testing Idempotency

RSpec.describe ProcessPaymentJob do
  it 'is idempotent' do
    order = create(:order, total: 100)
    idempotency_key = "test-key-#{SecureRandom.hex}"

    # Run twice
    2.times do
      ProcessPaymentJob.new.perform(order.id, idempotency_key)
    end

    # Should only create one payment
    expect(Payment.where(order: order).count).to eq(1)
  end
end

Performance Tuning

Concurrency Settings

# config/sidekiq.yml
:concurrency: 10  # 10 threads per process

Guidelines:

  • CPU-bound jobs: Match CPU cores
  • I/O-bound jobs (API calls, DB): Higher concurrency (15-25)
  • Memory-heavy jobs: Lower concurrency (2-5)

Connection Pooling

# config/database.yml
production:
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 }.to_i + ENV.fetch("SIDEKIQ_CONCURRENCY") { 10 }.to_i %>

Each Sidekiq thread needs a database connection. Size your pool accordingly.

Batch Operations

Instead of many small jobs:

# SLOW: One job per user
users.each { |u| ProcessUserJob.perform_async(u.id) }

# FASTER: Batch processing
users.each_slice(100) do |batch|
  ProcessUserBatchJob.perform_async(batch.map(&:id))
end

The Bottom Line

Reliable Sidekiq jobs come down to three principles:

  1. Assume jobs run multiple times: Design for idempotency from the start
  2. Keep jobs focused: One job, one responsibility, fast execution
  3. Monitor everything: You can't fix what you can't see

Sidekiq handles the hard distributed systems problems—retries, persistence, concurrency. Your job is to write code that works correctly regardless of how many times it runs.

Start simple. Add complexity only when you have evidence you need it. A well-designed idempotent job beats a complex distributed transaction every time.