How to Train an LLM: Fine-Tuning Guide with Python & Hugging Face

Let me be real with you. Training an LLM from scratch is not something you will do on your laptop. Or even on a beefy workstation. We are talking thousands of GPU hours. Millions of dollars in compute.

What you can do is fine-tune. Take a pre-trained model. Adapt it to your domain. This is practical. This is what most of us actually do.

What You Need

Hardware reality check:

  • Minimum: 16GB GPU (RTX 4080, A4000)
  • Comfortable: 24GB GPU (RTX 4090, A5000)
  • Production: Multiple A100s or H100s

No GPU? Use Google Colab Pro ($10/month) or Lambda Labs ($1.10/hour for A10).

Software stack:

pip install torch transformers datasets accelerate peft bitsandbytes

Python 3.10+. CUDA 11.8 or higher.

Pick Your Base Model

Start small. DistilGPT-2 has 82M parameters. Good for learning. Llama-2-7B is production-ready but needs 14GB+ VRAM.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "distilgpt2"  # Start here
# model_name = "meta-llama/Llama-2-7b-hf"  # Graduate to this

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# DistilGPT2 needs a pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

Prepare Your Dataset

Your data quality determines your results. Garbage in, garbage out.

from datasets import load_dataset, Dataset

# Option 1: Use existing datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Option 2: Load your own data
def load_custom_data(file_path):
    with open(file_path, "r") as f:
        texts = f.read().split("\n\n")  # Split by paragraphs
    return Dataset.from_dict({"text": texts})

# Tokenize everything
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors="pt"
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

512 tokens per sample is a reasonable starting point. Longer contexts need more VRAM.

The Training Loop

Here is a basic training setup. Works on a single GPU.

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal LM, not masked LM
)

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Lower if OOM
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=500,
    eval_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True,
    fp16=True,  # Mixed precision, saves memory
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
)

trainer.train()

When You Run Out of Memory

You will. Everyone does. Here are your options.

Option 1: Gradient checkpointing

model.gradient_checkpointing_enable()

Trades compute for memory. Training is 20% slower but uses 60% less VRAM.

Option 2: LoRA (Low-Rank Adaptation)

Only train a small adapter. Keeps base model frozen.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn", "c_proj"],  # GPT-2 specific
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.1% of total

LoRA makes this practical. Fine-tune a 7B model on 8GB VRAM.

Option 3: Quantization

Load model in 8-bit or 4-bit.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

4-bit + LoRA = QLoRA. This is how people fine-tune 70B models on consumer hardware.

Evaluate Your Model

Perplexity is the standard metric. Lower is better.

import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

But perplexity does not tell the whole story. Generate some text. Read it. Does it sound right?

def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_text("The future of AI is"))

Save and Deploy

Save your fine-tuned model:

# Full model
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")

# Or push to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model")

For production inference, look into:

  • vLLM: Fast inference server, handles batching automatically
  • text-generation-inference: Hugging Face's production server
  • llama.cpp: CPU inference, good for edge deployment

Common Mistakes

  1. Learning rate too high: Start at 2e-5. If loss spikes, go lower.

  2. Not enough data: Fine-tuning needs thousands of examples minimum. Tens of thousands for real improvement.

  3. Training too long: Watch validation loss. Stop when it plateaus or rises.

  4. Wrong task format: If you want instruction-following, format data as instructions. Not raw text.

  5. Ignoring the base model: Fine-tuning cannot fix fundamental model limitations. Choose your base wisely.

Realistic Expectations

Fine-tuning adapts style and domain knowledge. It does not create new capabilities. A fine-tuned 7B model will not match GPT-4.

What fine-tuning is good for:

  • Adapting to your company's writing style
  • Learning domain-specific terminology
  • Following a specific output format
  • Reducing unwanted behaviors

What it cannot do:

  • Make a small model as smart as a large one
  • Add knowledge the base model never saw
  • Fix fundamental reasoning limitations

Start with DistilGPT-2. Get the pipeline working. Then scale up. This approach saves time, money, and frustration.