I moved our 7B model inference from CPU to GPU last month. The result: 15x faster. Here's exactly how to get the same gains.
The Numbers
Let's start with real measurements. I ran these on identical prompts, same model weights.
| Hardware | Tokens/sec | Memory |
|---|---|---|
| CPU (16-core, 64GB) | 12 | 28GB RAM |
| RTX 4090 (24GB) | 186 | 14GB VRAM |
| A100 (80GB) | 312 | 14GB VRAM |
That's not a typo. GPU inference is 15-26x faster. Memory usage drops by half with fp16.
Setup That Actually Works
Most CUDA issues come from version mismatches. Here's the correct order.
# Check your GPU first
nvidia-smi
# Install CUDA 11.8 (works with PyTorch 2.1+)
wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
sudo sh cuda_11.8.0_520.61.05_linux.run
# PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Verify
python -c "import torch; print(torch.cuda.is_available())"If that prints True, you're good.
Verification Script
Run this before anything else. It catches 90% of setup problems.
import torch
import psutil
def verify_cuda():
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if not torch.cuda.is_available():
print("CUDA not detected. Check driver installation.")
return False
print(f"CUDA version: {torch.version.cuda}")
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
print(f"GPU {i}: {props.name}")
print(f" VRAM: {props.total_memory / 1024**3:.1f} GB")
print(f" Compute: {props.major}.{props.minor}")
# Quick sanity check
x = torch.randn(1000, 1000, device="cuda")
y = torch.mm(x, x)
torch.cuda.synchronize()
print("Matrix multiplication: OK")
return True
if __name__ == "__main__":
verify_cuda()Production Inference Class
This is the code I use in production. Copy it.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import time
class LLMInference:
def __init__(self, model_name: str, fp16: bool = True):
self.device = self._select_gpu()
self.fp16 = fp16
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.model = self._load_model(model_name)
self._warmup()
def _select_gpu(self) -> torch.device:
if not torch.cuda.is_available():
return torch.device("cpu")
# Pick GPU with most memory
memories = [torch.cuda.get_device_properties(i).total_memory
for i in range(torch.cuda.device_count())]
best = memories.index(max(memories))
return torch.device(f"cuda:{best}")
def _load_model(self, name: str):
dtype = torch.float16 if self.fp16 else torch.float32
model = AutoModelForCausalLM.from_pretrained(
name,
torch_dtype=dtype,
device_map="auto",
low_cpu_mem_usage=True
)
# PyTorch 2.0+ compilation
if hasattr(torch, "compile"):
model = torch.compile(model)
model.eval()
return model
def _warmup(self):
dummy = torch.tensor([[1, 2, 3]], device=self.device)
with torch.no_grad():
self.model(dummy)
torch.cuda.synchronize()
def generate(self, prompt: str, max_tokens: int = 100) -> dict:
start = time.perf_counter()
inputs = self.tokenizer(prompt, return_tensors="pt")
input_ids = inputs.input_ids.to(self.device)
input_len = input_ids.shape[1]
with torch.no_grad(), torch.cuda.amp.autocast(enabled=self.fp16):
outputs = self.model.generate(
input_ids,
max_new_tokens=max_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
use_cache=True,
pad_token_id=self.tokenizer.pad_token_id
)
text = self.tokenizer.decode(outputs[0][input_len:], skip_special_tokens=True)
elapsed = time.perf_counter() - start
tokens = outputs.shape[1] - input_len
return {
"text": text,
"tokens": tokens,
"time_s": elapsed,
"tokens_per_sec": tokens / elapsed,
"vram_gb": torch.cuda.memory_allocated() / 1024**3
}Usage:
llm = LLMInference("mistralai/Mistral-7B-v0.1")
result = llm.generate("Explain CUDA in one sentence.")
print(f"{result['tokens_per_sec']:.1f} tokens/sec")Memory Optimization
Memory is your bottleneck. These techniques actually work.
Half Precision
FP16 cuts memory in half with minimal quality loss.
model = model.half() # That's itFor training, use AMP:
scaler = torch.cuda.amp.GradScaler()
with torch.cuda.amp.autocast():
loss = model(**batch).loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()Gradient Checkpointing
Trades compute for memory during training. Essential for large models.
model.gradient_checkpointing_enable()This reduces memory by 60% on a 7B model. Training slows by ~20%.
Dynamic Batching
Process multiple prompts efficiently:
def batch_generate(model, tokenizer, prompts, batch_size=8):
results = []
for i in range(0, len(prompts), batch_size):
batch = prompts[i:i + batch_size]
inputs = tokenizer(
batch,
return_tensors="pt",
padding=True,
truncation=True
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100)
for j, output in enumerate(outputs):
text = tokenizer.decode(output[len(inputs.input_ids[j]):],
skip_special_tokens=True)
results.append(text)
return resultsBenchmarking Script
Measure before you optimize.
import torch
import time
def benchmark(model, tokenizer, prompt, seq_lengths=[50, 100, 200]):
device = next(model.parameters()).device
for length in seq_lengths:
torch.cuda.empty_cache()
torch.cuda.synchronize()
inputs = tokenizer(prompt, return_tensors="pt").to(device)
start = time.perf_counter()
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=length, do_sample=False)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
tokens = outputs.shape[1] - inputs.input_ids.shape[1]
vram = torch.cuda.memory_allocated() / 1024**3
print(f"Length {length}: {tokens/elapsed:.1f} tok/s, {vram:.2f} GB VRAM")Typical output on RTX 4090 with Mistral-7B:
Length 50: 189.2 tok/s, 14.21 GB VRAM
Length 100: 185.7 tok/s, 14.23 GB VRAM
Length 200: 178.4 tok/s, 14.31 GB VRAM
Production Dockerfile
FROM nvidia/cuda:11.8-devel-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install torch --index-url https://download.pytorch.org/whl/cu118
RUN pip3 install transformers accelerate
COPY . /app
WORKDIR /app
ENV CUDA_VISIBLE_DEVICES=0
ENV PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
CMD ["python3", "server.py"]Run with:
docker run --gpus all -p 8000:8000 your-imageCommon Problems
OOM errors: Reduce batch size. Enable gradient checkpointing. Use fp16.
Slow loading: Use device_map="auto" and low_cpu_mem_usage=True.
Version mismatch: CUDA toolkit version must match PyTorch's expected version. Check torch.version.cuda.
Inconsistent speeds: Always warm up the model. First inference is slow due to kernel compilation.
When to Skip CUDA
GPU isn't always better.
- Models under 1B parameters: CPU is often fine
- Infrequent requests: GPU idle time costs money
- Development: CPU debugging is easier
- VRAM too small: Constant swapping is slower than CPU
Key Takeaways
The 15x speedup came from:
- FP16 inference (half memory, same speed)
- PyTorch 2.0 compilation (10-20% faster)
- KV caching enabled (essential)
- Proper warmup (consistent latency)
- Right batch sizes (maximize throughput)
Start with the inference class above. Measure your baseline. Optimize from there.
The code in this article runs in production. It handles millions of requests per day. Use it.