Grok-1: 314B Total Parameters, 25B Active Per Token (MoE Explained)

Alex Kholodniak

• 24 Mar 2024 • 4 min read

Grok-1 has 314 billion parameters. That number sounds impressive until you understand what it actually means for inference.

The model uses Mixture of Experts (MoE) architecture. Only about 25 billion parameters activate per forward pass. This makes Grok-1 computationally smaller than GPT-3's 175B dense model during inference.

How MoE Works in Practice

The architecture routes each token to a subset of specialized experts. Think of it as having 8 specialists instead of one generalist. Each token only consults 2 experts per layer.

class MixtureOfExperts:
    def __init__(self, num_experts=8, expert_capacity=2):
        self.experts = [Expert() for _ in range(num_experts)]
        self.router = Router()
        self.expert_capacity = expert_capacity

    def forward(self, input_tokens):
        expert_weights, expert_indices = self.router(input_tokens)
        active_experts = expert_indices[:, :self.expert_capacity]

        outputs = []
        for i, expert_idx in enumerate(active_experts):
            expert_output = self.experts[expert_idx](input_tokens[i])
            weighted_output = expert_output * expert_weights[i][expert_idx]
            outputs.append(weighted_output)

        return sum(outputs)

The router decides which experts handle which tokens. This decision happens at every MoE layer. Bad routing leads to dead experts or overloaded ones. Load balancing is critical.

Why JAX Instead of PyTorch

xAI built Grok-1 with JAX. This choice has practical implications.

JAX compiles to XLA. This means better performance on TPUs and modern GPUs. The functional paradigm also makes distributed training more predictable.

import jax
import jax.numpy as jnp
from flax import linen as nn

class GrokTransformerBlock(nn.Module):
    def setup(self):
        self.attention = nn.MultiHeadDotProductAttention(
            num_heads=32,
            dtype=jnp.bfloat16
        )
        self.moe_layer = MixtureOfExpertsLayer(
            num_experts=8,
            expert_capacity=2
        )
        self.layer_norm1 = nn.LayerNorm()
        self.layer_norm2 = nn.LayerNorm()

    def __call__(self, x):
        normed_x = self.layer_norm1(x)
        attended = self.attention(normed_x)
        x = x + attended

        normed_x = self.layer_norm2(x)
        expert_output = self.moe_layer(normed_x)
        x = x + expert_output

        return x

The downside: fewer tutorials, smaller community, steeper learning curve. If you know PyTorch well, expect adjustment time.

Expert Routing Details

The routing mechanism determines model quality. Here is a simplified version:

class ExpertRouter(nn.Module):
    def setup(self):
        self.gate = nn.Dense(self.num_experts)

    def __call__(self, x):
        logits = self.gate(x)
        gates = jax.nn.softmax(logits + self.gating_noise())
        top_k_gates, top_k_indices = jax.lax.top_k(gates, k=2)

        return top_k_gates, top_k_indices

    def gating_noise(self):
        if self.training:
            noise = jax.random.normal(self.make_rng('gating'), self.gate.features)
            return noise * self.noise_epsilon
        return 0.0

Training noise prevents expert collapse. Without it, the router learns to always pick the same experts. This wastes capacity.

Hardware Requirements

Running Grok-1 requires serious infrastructure:

GPU Memory: 80GB minimum (A100 80GB or H100)
System RAM: 64GB minimum
Storage: 500GB for model weights
Fine-tuning: Multiply everything by 3-4x

Cost comparison per 1M tokens:

AWS p4d.24xlarge: ~$2.40
Google Cloud TPU v4: ~$1.80
Azure NC96ads A100 v4: ~$2.10
GPT-4 API: ~$30.00

Self-hosting only makes sense at high volumes. The break-even point requires significant daily token throughput.

Code Generation Performance

Grok-1 produces working code with reasonable algorithmic choices:

def find_primes(n):
    """Return all prime numbers up to n using Sieve of Eratosthenes."""
    if n < 2:
        return []

    sieve = [True] * (n + 1)
    sieve[0] = sieve[1] = False

    for i in range(2, int(n**0.5) + 1):
        if sieve[i]:
            for j in range(i*i, n + 1, i):
                sieve[j] = False

    return [i for i in range(2, n + 1) if sieve[i]]

It picks the Sieve of Eratosthenes over naive trial division. This suggests good training data for algorithmic problems.

Where Grok-1 Works Well

Testing shows strength in three areas:

Code analysis: Identifies bottlenecks, memory leaks, security issues
Language translation: Converts between programming languages while preserving idioms
Technical documentation: Generates accurate API docs and explanations

Known Limitations

Memory management is unpredictable. Long sequences can cause OOM errors without warning. Use gradient checkpointing:

def forward_with_checkpointing(model, inputs):
    return jax.checkpoint(model)(inputs)

Expert load balancing degrades over long fine-tuning runs. Monitor expert utilization metrics.

The model sometimes over-explains simple concepts. It can also present incorrect information confidently. Verify outputs.

Practical Starting Point

Begin with reduced configuration:

config = {
    'num_layers': 24,      # Full model: 64
    'num_experts': 4,      # Full model: 8
    'hidden_dim': 2048,    # Full model: 4096
}

model = GrokModel(config)
model = nn.checkpoint(model)

@jax.jit
def generate_text(params, prompt_tokens):
    return model.apply(params, prompt_tokens, method=model.generate)

def track_memory():
    devices = jax.devices()
    for device in devices:
        memory_info = device.memory_stats()
        print(f"Device {device}: {memory_info}")

Start small. Verify everything works. Scale up incrementally.

When to Use Grok-1

Use it if you have:

High-volume specialized workloads
Need for domain-specific fine-tuning
Adequate hardware infrastructure
Interest in understanding the architecture

Skip it if you need:

General-purpose applications
Cost efficiency at low volumes
Fast prototyping
Minimal ML infrastructure

Summary

Grok-1's 314B parameter count is marketing. The real number is 25B active parameters per inference. The MoE architecture trades total parameters for specialized routing.

The JAX implementation is well-engineered. The open weights enable research and customization. Hardware requirements limit practical deployment to well-funded teams.

For most developers, API-based models remain more practical. Grok-1 matters for those building specialized AI systems at scale or researching transformer architectures.

How MoE Works in Practice

Why JAX Instead of PyTorch

Expert Routing Details

Hardware Requirements

Code Generation Performance

Where Grok-1 Works Well

Known Limitations

Practical Starting Point

When to Use Grok-1

Summary

Share this article

Related Articles

Fine-Tuning LLMs with QLoRA: Run a 7B Model on a Single GPU

Building RAG from Scratch: A Python Guide with LangChain

Expose Local LLMs with ngrok: Ollama and LM Studio