GPU Computing in Ruby with CUDA

Ruby is slow. GPUs are fast. Can we combine them? Yes, but with caveats. This guide covers the practical steps: setting up CUDA, writing kernels, calling them from Ruby through FFI, and knowing when it actually pays off.

What is CUDA?

CUDA is NVIDIA's platform for general-purpose GPU programming. Instead of writing shaders for graphics, you write kernels — functions that run across thousands of GPU cores simultaneously.

A modern NVIDIA GPU like the RTX 4090 has 16,384 CUDA cores. Each core is simple compared to a CPU core, but the sheer count makes GPUs dominant for data-parallel workloads. A task that takes 10 seconds on a CPU can finish in milliseconds on a GPU, provided the problem breaks into independent chunks.

The programming model works like this: you write a kernel function in CUDA C, compile it, allocate memory on the GPU, copy data over, launch the kernel across a grid of thread blocks, and copy results back. The key terms to know:

  • Kernel — a function that runs on the GPU, executed by many threads in parallel
  • Thread block — a group of threads (up to 1024) that can share fast local memory
  • Grid — the full collection of thread blocks launched for a kernel
  • Device memory — GPU RAM (VRAM), separate from system RAM

Why Ruby with CUDA?

Ruby prioritizes developer happiness over raw speed. The Global Interpreter Lock (GIL) in CRuby limits true parallelism for CPU-bound work. Ruby was never designed for number crunching.

So why bother? Two practical reasons:

  1. Your application is already in Ruby — you have a Rails app or a Ruby data pipeline and one specific step is a bottleneck
  2. You want to offload a specific computation — matrix math, image processing, or a simulation that runs millions of iterations

You are not rewriting your app. You are accelerating one hot path. A similar pattern applies when offloading audio processing to external services, as shown in speech recognition with Ruby.

Environment Setup

Before writing any Ruby code, you need a working CUDA toolchain.

Step 1: Install the CUDA Toolkit

Download from developer.nvidia.com/cuda-toolkit. On Ubuntu:

sudo apt install nvidia-cuda-toolkit
nvcc --version  # Verify installation

On macOS, note that NVIDIA dropped CUDA support after macOS 10.13. You need a Linux machine or a cloud GPU instance (AWS p3.2xlarge, for example).

Step 2: Write and compile a CUDA shared library

Create a file called vector_ops.cu:

// vector_ops.cu
extern "C" {
  __global__ void vector_add(float *a, float *b, float *c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      c[i] = a[i] + b[i];
  }

  __global__ void vector_scale(float *a, float scalar, float *out, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      out[i] = a[i] * scalar;
  }
}

Compile it into a shared library:

nvcc --shared -o libvector_ops.so vector_ops.cu -Xcompiler -fPIC

Step 3: Install the Ruby FFI gem

gem install ffi

This is the most reliable way to call CUDA code from Ruby. The ffi gem is well-maintained and does not depend on abandoned CUDA-specific Ruby gems.

Calling CUDA from Ruby via FFI

The approach is straightforward: use Ruby FFI to call the CUDA Runtime API (libcudart.so) directly. This gives you full control over memory allocation, data transfer, and kernel launches.

require 'ffi'

module CudaRT
  extend FFI::Library
  ffi_lib 'cudart'

  # Memory management
  attach_function :cudaMalloc, [:pointer, :size_t], :int
  attach_function :cudaFree, [:pointer], :int
  attach_function :cudaMemcpy, [:pointer, :pointer, :size_t, :int], :int

  # Constants for cudaMemcpy direction
  HOST_TO_DEVICE = 1
  DEVICE_TO_HOST = 2
end

module VectorOps
  extend FFI::Library
  ffi_lib './libvector_ops.so'

  attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
end

There is a complication: you cannot call __global__ kernel functions directly through FFI because kernel launches require the <<<grid, block>>> syntax, which is a CUDA compiler extension. The clean solution is to write a wrapper function in your .cu file:

// vector_ops.cu — with host wrapper
extern "C" {
  __global__ void vector_add_kernel(float *a, float *b, float *c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      c[i] = a[i] + b[i];
  }

  void vector_add(float *a, float *b, float *c, int n) {
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add_kernel<<<blocks, threads>>>(a, b, c, n);
    cudaDeviceSynchronize();
  }

  void allocate_device(float **ptr, int n) {
    cudaMalloc(ptr, n * sizeof(float));
  }

  void copy_to_device(float *dst, float *src, int n) {
    cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyHostToDevice);
  }

  void copy_to_host(float *dst, float *src, int n) {
    cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyDeviceToHost);
  }

  void free_device(float *ptr) {
    cudaFree(ptr);
  }
}

Now the Ruby side becomes clean:

require 'ffi'

module GPU
  extend FFI::Library
  ffi_lib './libvector_ops.so'

  attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
  attach_function :allocate_device, [:pointer, :int], :void
  attach_function :copy_to_device, [:pointer, :pointer, :int], :void
  attach_function :copy_to_host, [:pointer, :pointer, :int], :void
  attach_function :free_device, [:pointer], :void
end

n = 1_000_000

# Prepare host data
a_host = FFI::MemoryPointer.new(:float, n)
b_host = FFI::MemoryPointer.new(:float, n)
c_host = FFI::MemoryPointer.new(:float, n)

a_host.put_array_of_float(0, Array.new(n) { rand })
b_host.put_array_of_float(0, Array.new(n) { rand })

# Allocate device memory
a_dev_ptr = FFI::MemoryPointer.new(:pointer)
b_dev_ptr = FFI::MemoryPointer.new(:pointer)
c_dev_ptr = FFI::MemoryPointer.new(:pointer)

GPU.allocate_device(a_dev_ptr, n)
GPU.allocate_device(b_dev_ptr, n)
GPU.allocate_device(c_dev_ptr, n)

a_dev = a_dev_ptr.read_pointer
b_dev = b_dev_ptr.read_pointer
c_dev = c_dev_ptr.read_pointer

# Transfer data to GPU
GPU.copy_to_device(a_dev, a_host, n)
GPU.copy_to_device(b_dev, b_host, n)

# Run kernel
GPU.vector_add(a_dev, b_dev, c_dev, n)

# Transfer results back
GPU.copy_to_host(c_host, c_dev, n)

result = c_host.get_array_of_float(0, n)
puts "First 5 results: #{result.first(5)}"

# Clean up
GPU.free_device(a_dev)
GPU.free_device(b_dev)
GPU.free_device(c_dev)

This is verbose, but it works. Every step is explicit: allocate, copy, compute, copy back, free.

Performance: GPU vs CPU

The GPU only wins when the data is large enough to justify the transfer overhead. Here is a benchmark comparing vector addition on CPU vs GPU for different array sizes:

require 'benchmark'

sizes = [1_000, 100_000, 1_000_000, 10_000_000]

sizes.each do |n|
  a = Array.new(n) { rand }
  b = Array.new(n) { rand }

  cpu_time = Benchmark.realtime do
    c = a.zip(b).map { |x, y| x + y }
  end

  gpu_time = Benchmark.realtime do
    # ... GPU version using the FFI code above
  end

  puts "n=#{n}: CPU=#{cpu_time.round(4)}s  GPU=#{gpu_time.round(4)}s"
end

Typical results on an RTX 3080 with Ruby 3.3:

Array SizeCPU (Ruby)GPU (CUDA)Speedup
1,0000.0001s0.002s0.05x (slower)
100,0000.01s0.003s3x
1,000,0000.12s0.005s24x
10,000,0001.3s0.02s65x

The crossover point is around 50,000 elements for simple operations. For more complex per-element computations (trigonometry, conditionals), the GPU wins at smaller sizes because the compute-to-transfer ratio improves.

Matrix Operations with Cumo

If you need GPU-accelerated matrix math without writing CUDA kernels yourself, cumo wraps NVIDIA's cuBLAS library:

require 'cumo/narray'

a = Cumo::SFloat.new(2000, 2000).rand
b = Cumo::SFloat.new(2000, 2000).rand

# Matrix multiplication runs on GPU via cuBLAS
c = a.dot(b)

# Element-wise operations also run on GPU
d = Cumo::NMath.sqrt(a) + Cumo::NMath.log(b + 1)

Cumo mirrors the Numo API, so switching between CPU and GPU execution often means changing one require statement.

When GPU Acceleration Makes Sense

GPU acceleration helps when:

  • You process millions of elements or more
  • The computation is data-parallel — the same operation applied independently to each element
  • Data stays on the GPU across multiple operations (avoiding repeated transfers)
  • The operation is compute-heavy (not just memory access)

GPU acceleration hurts when:

  • Arrays are small (under 50,000 elements for simple ops)
  • Operations are inherently sequential (each step depends on the previous result)
  • You constantly shuttle data between CPU and GPU
  • The computation involves heavy branching (GPU threads diverge and serialize)

Practical Use Cases

Batch Image Processing

Process thousands of images without round-tripping each one back to the CPU:

# Keep data on GPU between operations
gpu_images = upload_batch_to_gpu(image_paths)
gpu_images = apply_gaussian_blur(gpu_images, radius: 5)
gpu_images = resize_all(gpu_images, width: 800, height: 600)
gpu_images = normalize_pixels(gpu_images)
results = download_from_gpu(gpu_images)

Financial Monte Carlo Simulations

Simulating millions of random price paths is embarrassingly parallel:

# Each thread simulates one price path independently
# 10 million paths, 252 trading days each
GPU.monte_carlo_paths(
  initial_price: 100.0,
  volatility: 0.2,
  risk_free_rate: 0.05,
  paths: 10_000_000,
  steps: 252
)

Machine Learning Inference

Load a pre-trained ONNX model and run inference on GPU using a C wrapper around NVIDIA's TensorRT, called from Ruby via FFI. The model stays in GPU memory; only the input/output tensors transfer. If you want to explore ML inference without GPU complexity, our guide on integrating machine learning with Ruby covers CPU-based approaches.

Alternatives to Direct CUDA

Before writing CUDA kernels, consider these options:

  1. PyCall + PyTorch — call Python GPU code from Ruby with pycall gem. Mature ecosystem, minimal CUDA knowledge needed. For a Python-centric take on GPU acceleration, see accelerating LLMs with CUDA and Python.
  2. GPU microservice — run GPU workloads as a separate service (Python/C++) behind an HTTP or gRPC API. Your Ruby app sends data, gets results back.
  3. Numo + OpenBLAS — for matrix math, CPU-optimized BLAS can be surprisingly fast. A 1000x1000 matrix multiply takes ~50ms on a modern CPU with OpenBLAS.
  4. Rust + Magnus — write a Rust extension using the cust crate for CUDA, expose it to Ruby via Magnus. Type-safe and fast.

Conclusion

Ruby can drive CUDA through FFI wrappers. The pattern is always the same: compile your kernels into a shared library, write thin C wrapper functions for kernel launches, and call those wrappers from Ruby.

The integration works best when you have an existing Ruby application with one identifiable bottleneck that involves large-scale parallel computation. For those cases, a single .cu file and a few FFI bindings can deliver 10-100x speedups on the hot path without rewriting your application.

For new projects where GPU computing is the core requirement, Python or C++ remain better choices. But for adding targeted GPU acceleration to a Ruby system, this approach is practical and production-viable.