GPU Computing in Ruby with CUDA

Alex Kholodniak

• 25 Feb 2024 • 3 min read

Ruby is slow. GPUs are fast. Can we combine them? Yes, but with caveats.

What is CUDA?

CUDA is NVIDIA's platform for GPU programming. It lets you run code on graphics cards. GPUs have thousands of cores. They process data in parallel.

CPUs handle complex tasks sequentially. GPUs handle simple tasks simultaneously. This difference matters for certain workloads.

Ruby and Performance

Ruby prioritizes developer happiness over speed. It was never designed for number crunching. The Global Interpreter Lock (GIL) limits true parallelism.

So why use Ruby with CUDA? Two reasons:

Your app is already in Ruby
You need to offload specific heavy computations

Setting Up

You need three things:

An NVIDIA GPU
CUDA Toolkit installed
A Ruby CUDA binding

Install the toolkit from NVIDIA's site. Then add a gem:

gem install cuda

Note: The cuda gem has limited maintenance. Check for alternatives like cumo or FFI bindings to CUDA libraries.

Basic Example: Vector Addition

Here is a simple kernel that adds two arrays:

require 'cuda'
include Cuda

kernel_code = <<~CUDA
  extern "C"
  __global__ void vector_add(float *a, float *b, float *c, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
      c[index] = a[index] + b[index];
  }
CUDA

program = Cuda::Program.new(kernel_code)

n = 1024
a = Array.new(n) { rand }
b = Array.new(n) { rand }

# Allocate GPU memory
a_gpu = program.malloc_and_copy(a.pack('F*'))
b_gpu = program.malloc_and_copy(b.pack('F*'))
c_gpu = program.malloc(n * 4)

# Run the kernel
threads_per_block = 256
blocks = (n + threads_per_block - 1) / threads_per_block

program.launch(
  'vector_add',
  a_gpu, b_gpu, c_gpu, n,
  grid: [blocks, 1, 1],
  block: [threads_per_block, 1, 1]
)

# Get results back
result = "\x00" * (n * 4)
c_gpu.copy_to_host(result, n * 4)
c = result.unpack('F*')

# Clean up
program.free(a_gpu)
program.free(b_gpu)
program.free(c_gpu)

This is verbose. The data transfer overhead is significant for small arrays.

When GPU Acceleration Makes Sense

GPU acceleration helps when:

You process millions of elements
The computation is parallelizable
Data transfer time is small relative to compute time

GPU acceleration hurts when:

Arrays are small (under 10,000 elements)
Operations are sequential
You constantly move data between CPU and GPU

Realistic Use Cases for Ruby + CUDA

Image Processing

Batch process thousands of images. Keep data on GPU between operations.

# Pseudocode - depends on your CUDA bindings
images.each_batch(1000) do |batch|
  gpu_batch = upload_to_gpu(batch)
  apply_filter(gpu_batch)
  apply_resize(gpu_batch)
  apply_normalize(gpu_batch)
  download_from_gpu(gpu_batch)
end

Matrix Operations

Large matrix multiplications benefit from GPU parallelism. Libraries like cumo wrap cuBLAS.

require 'cumo/narray'

a = Cumo::SFloat.new(1000, 1000).rand
b = Cumo::SFloat.new(1000, 1000).rand
c = a.dot(b)  # Runs on GPU

Machine Learning Inference

Load a pre-trained model. Run inference on GPU. Return results to Ruby.

The Honest Truth

Ruby is not the right choice for GPU-heavy applications. Python has better tooling. C++ has better performance.

Use Ruby + CUDA when:

You have an existing Ruby system
You need to accelerate one specific bottleneck
The bottleneck involves large parallel computations

Do not use Ruby + CUDA when:

Building a new GPU-focused application
You need tight GPU integration throughout
Performance is the primary concern

Alternatives to Consider

Call Python from Ruby - Use PyCall to access PyTorch or TensorFlow
Use a service - Run GPU code as a microservice
FFI bindings - Call CUDA C libraries directly
Numo + OpenBLAS - CPU-optimized but still fast for many workloads

Conclusion

Ruby can use CUDA. The integration works. But the ecosystem is limited.

For occasional GPU acceleration in existing Ruby apps, it is viable. For new projects with heavy GPU requirements, choose a different language.

Pick the right tool for the job. Sometimes that means stepping outside Ruby.