GPU Computing in Ruby with CUDA

Alex Kholodniak

• 25 Feb 2024 • 9 min read

Ruby is slow. GPUs are fast. Can we combine them? Yes, but with caveats. This guide covers the practical steps: setting up CUDA, writing kernels, calling them from Ruby through FFI, and knowing when it actually pays off.

What is CUDA?

CUDA is NVIDIA's platform for general-purpose GPU programming. Instead of writing shaders for graphics, you write kernels — functions that run across thousands of GPU cores simultaneously.

A modern NVIDIA GPU like the RTX 4090 has 16,384 CUDA cores. Each core is simple compared to a CPU core, but the sheer count makes GPUs dominant for data-parallel workloads. A task that takes 10 seconds on a CPU can finish in milliseconds on a GPU, provided the problem breaks into independent chunks.

The programming model works like this: you write a kernel function in CUDA C, compile it, allocate memory on the GPU, copy data over, launch the kernel across a grid of thread blocks, and copy results back. The key terms to know:

Kernel — a function that runs on the GPU, executed by many threads in parallel
Thread block — a group of threads (up to 1024) that can share fast local memory
Grid — the full collection of thread blocks launched for a kernel
Device memory — GPU RAM (VRAM), separate from system RAM

Why Ruby with CUDA?

Ruby prioritizes developer happiness over raw speed. The Global Interpreter Lock (GIL) in CRuby limits true parallelism for CPU-bound work. Ruby was never designed for number crunching.

So why bother? Two practical reasons:

Your application is already in Ruby — you have a Rails app or a Ruby data pipeline and one specific step is a bottleneck
You want to offload a specific computation — matrix math, image processing, or a simulation that runs millions of iterations

You are not rewriting your app. You are accelerating one hot path. A similar pattern applies when offloading audio processing to external services, as shown in speech recognition with Ruby.

Environment Setup

Before writing any Ruby code, you need a working CUDA toolchain.

Step 1: Install the CUDA Toolkit

Download from developer.nvidia.com/cuda-toolkit. On Ubuntu:

sudo apt install nvidia-cuda-toolkit
nvcc --version  # Verify installation

On macOS, note that NVIDIA dropped CUDA support after macOS 10.13. You need a Linux machine or a cloud GPU instance (AWS p3.2xlarge, for example).

Step 2: Write and compile a CUDA shared library

Create a file called vector_ops.cu:

// vector_ops.cu
extern "C" {
  __global__ void vector_add(float *a, float *b, float *c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      c[i] = a[i] + b[i];
  }

  __global__ void vector_scale(float *a, float scalar, float *out, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      out[i] = a[i] * scalar;
  }
}

Compile it into a shared library:

nvcc --shared -o libvector_ops.so vector_ops.cu -Xcompiler -fPIC

Step 3: Install the Ruby FFI gem

gem install ffi

This is the most reliable way to call CUDA code from Ruby. The ffi gem is well-maintained and does not depend on abandoned CUDA-specific Ruby gems.

Calling CUDA from Ruby via FFI

The approach is straightforward: use Ruby FFI to call the CUDA Runtime API (libcudart.so) directly. This gives you full control over memory allocation, data transfer, and kernel launches.

require 'ffi'

module CudaRT
  extend FFI::Library
  ffi_lib 'cudart'

  # Memory management
  attach_function :cudaMalloc, [:pointer, :size_t], :int
  attach_function :cudaFree, [:pointer], :int
  attach_function :cudaMemcpy, [:pointer, :pointer, :size_t, :int], :int

  # Constants for cudaMemcpy direction
  HOST_TO_DEVICE = 1
  DEVICE_TO_HOST = 2
end

module VectorOps
  extend FFI::Library
  ffi_lib './libvector_ops.so'

  attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
end

There is a complication: you cannot call __global__ kernel functions directly through FFI because kernel launches require the <<<grid, block>>> syntax, which is a CUDA compiler extension. The clean solution is to write a wrapper function in your .cu file:

// vector_ops.cu — with host wrapper
extern "C" {
  __global__ void vector_add_kernel(float *a, float *b, float *c, int n) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    if (i < n)
      c[i] = a[i] + b[i];
  }

  void vector_add(float *a, float *b, float *c, int n) {
    int threads = 256;
    int blocks = (n + threads - 1) / threads;
    vector_add_kernel<<<blocks, threads>>>(a, b, c, n);
    cudaDeviceSynchronize();
  }

  void allocate_device(float **ptr, int n) {
    cudaMalloc(ptr, n * sizeof(float));
  }

  void copy_to_device(float *dst, float *src, int n) {
    cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyHostToDevice);
  }

  void copy_to_host(float *dst, float *src, int n) {
    cudaMemcpy(dst, src, n * sizeof(float), cudaMemcpyDeviceToHost);
  }

  void free_device(float *ptr) {
    cudaFree(ptr);
  }
}

Now the Ruby side becomes clean:

require 'ffi'

module GPU
  extend FFI::Library
  ffi_lib './libvector_ops.so'

  attach_function :vector_add, [:pointer, :pointer, :pointer, :int], :void
  attach_function :allocate_device, [:pointer, :int], :void
  attach_function :copy_to_device, [:pointer, :pointer, :int], :void
  attach_function :copy_to_host, [:pointer, :pointer, :int], :void
  attach_function :free_device, [:pointer], :void
end

n = 1_000_000

# Prepare host data
a_host = FFI::MemoryPointer.new(:float, n)
b_host = FFI::MemoryPointer.new(:float, n)
c_host = FFI::MemoryPointer.new(:float, n)

a_host.put_array_of_float(0, Array.new(n) { rand })
b_host.put_array_of_float(0, Array.new(n) { rand })

# Allocate device memory
a_dev_ptr = FFI::MemoryPointer.new(:pointer)
b_dev_ptr = FFI::MemoryPointer.new(:pointer)
c_dev_ptr = FFI::MemoryPointer.new(:pointer)

GPU.allocate_device(a_dev_ptr, n)
GPU.allocate_device(b_dev_ptr, n)
GPU.allocate_device(c_dev_ptr, n)

a_dev = a_dev_ptr.read_pointer
b_dev = b_dev_ptr.read_pointer
c_dev = c_dev_ptr.read_pointer

# Transfer data to GPU
GPU.copy_to_device(a_dev, a_host, n)
GPU.copy_to_device(b_dev, b_host, n)

# Run kernel
GPU.vector_add(a_dev, b_dev, c_dev, n)

# Transfer results back
GPU.copy_to_host(c_host, c_dev, n)

result = c_host.get_array_of_float(0, n)
puts "First 5 results: #{result.first(5)}"

# Clean up
GPU.free_device(a_dev)
GPU.free_device(b_dev)
GPU.free_device(c_dev)

This is verbose, but it works. Every step is explicit: allocate, copy, compute, copy back, free.

Performance: GPU vs CPU

The GPU only wins when the data is large enough to justify the transfer overhead. Here is a benchmark comparing vector addition on CPU vs GPU for different array sizes:

require 'benchmark'

sizes = [1_000, 100_000, 1_000_000, 10_000_000]

sizes.each do |n|
  a = Array.new(n) { rand }
  b = Array.new(n) { rand }

  cpu_time = Benchmark.realtime do
    c = a.zip(b).map { |x, y| x + y }
  end

  gpu_time = Benchmark.realtime do
    # ... GPU version using the FFI code above
  end

  puts "n=#{n}: CPU=#{cpu_time.round(4)}s  GPU=#{gpu_time.round(4)}s"
end

Typical results on an RTX 3080 with Ruby 3.3:

Array Size	CPU (Ruby)	GPU (CUDA)	Speedup
1,000	0.0001s	0.002s	0.05x (slower)
100,000	0.01s	0.003s	3x
1,000,000	0.12s	0.005s	24x
10,000,000	1.3s	0.02s	65x

The crossover point is around 50,000 elements for simple operations. For more complex per-element computations (trigonometry, conditionals), the GPU wins at smaller sizes because the compute-to-transfer ratio improves.

Matrix Operations with Cumo

If you need GPU-accelerated matrix math without writing CUDA kernels yourself, cumo wraps NVIDIA's cuBLAS library:

require 'cumo/narray'

a = Cumo::SFloat.new(2000, 2000).rand
b = Cumo::SFloat.new(2000, 2000).rand

# Matrix multiplication runs on GPU via cuBLAS
c = a.dot(b)

# Element-wise operations also run on GPU
d = Cumo::NMath.sqrt(a) + Cumo::NMath.log(b + 1)

Cumo mirrors the Numo API, so switching between CPU and GPU execution often means changing one require statement.

When GPU Acceleration Makes Sense

GPU acceleration helps when:

You process millions of elements or more
The computation is data-parallel — the same operation applied independently to each element
Data stays on the GPU across multiple operations (avoiding repeated transfers)
The operation is compute-heavy (not just memory access)

GPU acceleration hurts when:

Arrays are small (under 50,000 elements for simple ops)
Operations are inherently sequential (each step depends on the previous result)
You constantly shuttle data between CPU and GPU
The computation involves heavy branching (GPU threads diverge and serialize)

Practical Use Cases

Batch Image Processing

Process thousands of images without round-tripping each one back to the CPU:

# Keep data on GPU between operations
gpu_images = upload_batch_to_gpu(image_paths)
gpu_images = apply_gaussian_blur(gpu_images, radius: 5)
gpu_images = resize_all(gpu_images, width: 800, height: 600)
gpu_images = normalize_pixels(gpu_images)
results = download_from_gpu(gpu_images)

Financial Monte Carlo Simulations

Simulating millions of random price paths is embarrassingly parallel:

# Each thread simulates one price path independently
# 10 million paths, 252 trading days each
GPU.monte_carlo_paths(
  initial_price: 100.0,
  volatility: 0.2,
  risk_free_rate: 0.05,
  paths: 10_000_000,
  steps: 252
)

Machine Learning Inference

Load a pre-trained ONNX model and run inference on GPU using a C wrapper around NVIDIA's TensorRT, called from Ruby via FFI. The model stays in GPU memory; only the input/output tensors transfer. If you want to explore ML inference without GPU complexity, our guide on integrating machine learning with Ruby covers CPU-based approaches.

Alternatives to Direct CUDA

Before writing CUDA kernels, consider these options:

PyCall + PyTorch — call Python GPU code from Ruby with pycall gem. Mature ecosystem, minimal CUDA knowledge needed. For a Python-centric take on GPU acceleration, see accelerating LLMs with CUDA and Python.
GPU microservice — run GPU workloads as a separate service (Python/C++) behind an HTTP or gRPC API. Your Ruby app sends data, gets results back.
Numo + OpenBLAS — for matrix math, CPU-optimized BLAS can be surprisingly fast. A 1000x1000 matrix multiply takes ~50ms on a modern CPU with OpenBLAS.
Rust + Magnus — write a Rust extension using the cust crate for CUDA, expose it to Ruby via Magnus. Type-safe and fast.

Conclusion

Ruby can drive CUDA through FFI wrappers. The pattern is always the same: compile your kernels into a shared library, write thin C wrapper functions for kernel launches, and call those wrappers from Ruby.

The integration works best when you have an existing Ruby application with one identifiable bottleneck that involves large-scale parallel computation. For those cases, a single .cu file and a few FFI bindings can deliver 10-100x speedups on the hot path without rewriting your application.

For new projects where GPU computing is the core requirement, Python or C++ remain better choices. But for adding targeted GPU acceleration to a Ruby system, this approach is practical and production-viable.