Ruby is slow. GPUs are fast. Can we combine them? Yes, but with caveats. This guide covers the practical steps: setting up CUDA, writing kernels, calling them from Ruby through FFI, and knowing when it actually pays off.
What is CUDA?
CUDA is NVIDIA's platform for general-purpose GPU programming. Instead of writing shaders for graphics, you write kernels — functions that run across thousands of GPU cores simultaneously.
A modern NVIDIA GPU like the RTX 4090 has 16,384 CUDA cores. Each core is simple compared to a CPU core, but the sheer count makes GPUs dominant for data-parallel workloads. A task that takes 10 seconds on a CPU can finish in milliseconds on a GPU, provided the problem breaks into independent chunks.
The programming model works like this: you write a kernel function in CUDA C, compile it, allocate memory on the GPU, copy data over, launch the kernel across a grid of thread blocks, and copy results back. The key terms to know:
- Kernel — a function that runs on the GPU, executed by many threads in parallel
- Thread block — a group of threads (up to 1024) that can share fast local memory
- Grid — the full collection of thread blocks launched for a kernel
- Device memory — GPU RAM (VRAM), separate from system RAM
Why Ruby with CUDA?
Ruby prioritizes developer happiness over raw speed. The Global Interpreter Lock (GIL) in CRuby limits true parallelism for CPU-bound work. Ruby was never designed for number crunching.
So why bother? Two practical reasons:
- Your application is already in Ruby — you have a Rails app or a Ruby data pipeline and one specific step is a bottleneck
- You want to offload a specific computation — matrix math, image processing, or a simulation that runs millions of iterations
You are not rewriting your app. You are accelerating one hot path. A similar pattern applies when offloading audio processing to external services, as shown in speech recognition with Ruby.
Environment Setup
Before writing any Ruby code, you need a working CUDA toolchain.
Step 1: Install the CUDA Toolkit
Download from developer.nvidia.com/cuda-toolkit. On Ubuntu:
sudo apt install nvidia-cuda-toolkit
nvcc --version # Verify installationOn macOS, note that NVIDIA dropped CUDA support after macOS 10.13. You need a Linux machine or a cloud GPU instance (AWS p3.2xlarge, for example).
Step 2: Write and compile a CUDA shared library
Create a file called vector_ops.cu:
// vector_ops.cu
extern "C" {
__global__ void vector_add(float *a, float *b, float *c, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
c[i] = a[i] + b[i];
}
__global__ void vector_scale(float *a, float scalar, float *out, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
out[i] = a[i] * scalar;
}
}Compile it into a shared library:
nvcc --shared -o libvector_ops.so vector_ops.cu -Xcompiler -fPICStep 3: Install the Ruby FFI gem
gem install ffiThis is the most reliable way to call CUDA code from Ruby. The ffi gem is well-maintained and does not depend on abandoned CUDA-specific Ruby gems.
Calling CUDA from Ruby via FFI
The approach is straightforward: use Ruby FFI to call the CUDA Runtime API (libcudart.so) directly. This gives you full control over memory allocation, data transfer, and kernel launches.
require 'ffi'
module CudaRT
extend FFI::Library
ffi_lib 'cudart'
# Memory management
attach_function :cudaMalloc, [:pointer, :size_t], :int
attach_function :cudaFree, [:pointer], :int
attach_function :cudaMemcpy, [:pointer, :pointer, :size_t, :int], :int
# Constants for cudaMemcpy direction
HOST_TO_DEVICE = 1
DEVICE_TO_HOST = 2
end
module VectorOps
extend FFI::Library
ffi_lib './libvector_ops.so'
attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
endThere is a complication: you cannot call __global__ kernel functions directly through FFI because kernel launches require the <<<grid, block>>> syntax, which is a CUDA compiler extension. The clean solution is to write a wrapper function in your .cu file:
// vector_ops.cu — with host wrapper
extern "C" {
__global__ void vector_add_kernel(float *a, float *b, float *c, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
c[i] = a[i] + b[i];
}
void vector_add(float *a, float *b, float *c, int n) {
int threads = 256;
int blocks = (n + threads - 1) / threads;
vector_add_kernel<<<blocks, threads>>>(a, b, c, n);
cudaDeviceSynchronize();
}
void allocate_device(float **ptr, int n) {
cudaMalloc(ptr, n * sizeof(float));
}
void copy_to_device(float *dst, float *src, int n) {
cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyHostToDevice);
}
void copy_to_host(float *dst, float *src, int n) {
cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyDeviceToHost);
}
void free_device(float *ptr) {
cudaFree(ptr);
}
}Now the Ruby side becomes clean:
require 'ffi'
module GPU
extend FFI::Library
ffi_lib './libvector_ops.so'
attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
attach_function :allocate_device, [:pointer, :int], :void
attach_function :copy_to_device, [:pointer, :pointer, :int], :void
attach_function :copy_to_host, [:pointer, :pointer, :int], :void
attach_function :free_device, [:pointer], :void
end
n = 1_000_000
# Prepare host data
a_host = FFI::MemoryPointer.new(:float, n)
b_host = FFI::MemoryPointer.new(:float, n)
c_host = FFI::MemoryPointer.new(:float, n)
a_host.put_array_of_float(0, Array.new(n) { rand })
b_host.put_array_of_float(0, Array.new(n) { rand })
# Allocate device memory
a_dev_ptr = FFI::MemoryPointer.new(:pointer)
b_dev_ptr = FFI::MemoryPointer.new(:pointer)
c_dev_ptr = FFI::MemoryPointer.new(:pointer)
GPU.allocate_device(a_dev_ptr, n)
GPU.allocate_device(b_dev_ptr, n)
GPU.allocate_device(c_dev_ptr, n)
a_dev = a_dev_ptr.read_pointer
b_dev = b_dev_ptr.read_pointer
c_dev = c_dev_ptr.read_pointer
# Transfer data to GPU
GPU.copy_to_device(a_dev, a_host, n)
GPU.copy_to_device(b_dev, b_host, n)
# Run kernel
GPU.vector_add(a_dev, b_dev, c_dev, n)
# Transfer results back
GPU.copy_to_host(c_host, c_dev, n)
result = c_host.get_array_of_float(0, n)
puts "First 5 results: #{result.first(5)}"
# Clean up
GPU.free_device(a_dev)
GPU.free_device(b_dev)
GPU.free_device(c_dev)This is verbose, but it works. Every step is explicit: allocate, copy, compute, copy back, free.
Performance: GPU vs CPU
The GPU only wins when the data is large enough to justify the transfer overhead. Here is a benchmark comparing vector addition on CPU vs GPU for different array sizes:
require 'benchmark'
sizes = [1_000, 100_000, 1_000_000, 10_000_000]
sizes.each do |n|
a = Array.new(n) { rand }
b = Array.new(n) { rand }
cpu_time = Benchmark.realtime do
c = a.zip(b).map { |x, y| x + y }
end
gpu_time = Benchmark.realtime do
# ... GPU version using the FFI code above
end
puts "n=#{n}: CPU=#{cpu_time.round(4)}s GPU=#{gpu_time.round(4)}s"
endTypical results on an RTX 3080 with Ruby 3.3:
| Array Size | CPU (Ruby) | GPU (CUDA) | Speedup |
|---|---|---|---|
| 1,000 | 0.0001s | 0.002s | 0.05x (slower) |
| 100,000 | 0.01s | 0.003s | 3x |
| 1,000,000 | 0.12s | 0.005s | 24x |
| 10,000,000 | 1.3s | 0.02s | 65x |
The crossover point is around 50,000 elements for simple operations. For more complex per-element computations (trigonometry, conditionals), the GPU wins at smaller sizes because the compute-to-transfer ratio improves.
Matrix Operations with Cumo
If you need GPU-accelerated matrix math without writing CUDA kernels yourself, cumo wraps NVIDIA's cuBLAS library:
require 'cumo/narray'
a = Cumo::SFloat.new(2000, 2000).rand
b = Cumo::SFloat.new(2000, 2000).rand
# Matrix multiplication runs on GPU via cuBLAS
c = a.dot(b)
# Element-wise operations also run on GPU
d = Cumo::NMath.sqrt(a) + Cumo::NMath.log(b + 1)Cumo mirrors the Numo API, so switching between CPU and GPU execution often means changing one require statement.
When GPU Acceleration Makes Sense
GPU acceleration helps when:
- You process millions of elements or more
- The computation is data-parallel — the same operation applied independently to each element
- Data stays on the GPU across multiple operations (avoiding repeated transfers)
- The operation is compute-heavy (not just memory access)
GPU acceleration hurts when:
- Arrays are small (under 50,000 elements for simple ops)
- Operations are inherently sequential (each step depends on the previous result)
- You constantly shuttle data between CPU and GPU
- The computation involves heavy branching (GPU threads diverge and serialize)
Practical Use Cases
Batch Image Processing
Process thousands of images without round-tripping each one back to the CPU:
# Keep data on GPU between operations
gpu_images = upload_batch_to_gpu(image_paths)
gpu_images = apply_gaussian_blur(gpu_images, radius: 5)
gpu_images = resize_all(gpu_images, width: 800, height: 600)
gpu_images = normalize_pixels(gpu_images)
results = download_from_gpu(gpu_images)Financial Monte Carlo Simulations
Simulating millions of random price paths is embarrassingly parallel:
# Each thread simulates one price path independently
# 10 million paths, 252 trading days each
GPU.monte_carlo_paths(
initial_price: 100.0,
volatility: 0.2,
risk_free_rate: 0.05,
paths: 10_000_000,
steps: 252
)Machine Learning Inference
Load a pre-trained ONNX model and run inference on GPU using a C wrapper around NVIDIA's TensorRT, called from Ruby via FFI. The model stays in GPU memory; only the input/output tensors transfer. If you want to explore ML inference without GPU complexity, our guide on integrating machine learning with Ruby covers CPU-based approaches.
Alternatives to Direct CUDA
Before writing CUDA kernels, consider these options:
- PyCall + PyTorch — call Python GPU code from Ruby with
pycallgem. Mature ecosystem, minimal CUDA knowledge needed. For a Python-centric take on GPU acceleration, see accelerating LLMs with CUDA and Python. - GPU microservice — run GPU workloads as a separate service (Python/C++) behind an HTTP or gRPC API. Your Ruby app sends data, gets results back.
- Numo + OpenBLAS — for matrix math, CPU-optimized BLAS can be surprisingly fast. A 1000x1000 matrix multiply takes ~50ms on a modern CPU with OpenBLAS.
- Rust + Magnus — write a Rust extension using the
custcrate for CUDA, expose it to Ruby via Magnus. Type-safe and fast.
Conclusion
Ruby can drive CUDA through FFI wrappers. The pattern is always the same: compile your kernels into a shared library, write thin C wrapper functions for kernel launches, and call those wrappers from Ruby.
The integration works best when you have an existing Ruby application with one identifiable bottleneck that involves large-scale parallel computation. For those cases, a single .cu file and a few FFI bindings can deliver 10-100x speedups on the hot path without rewriting your application.
For new projects where GPU computing is the core requirement, Python or C++ remain better choices. But for adding targeted GPU acceleration to a Ruby system, this approach is practical and production-viable.