How to Train an LLM: Fine-Tuning Guide with Python & Hugging Face

Alex Kholodniak

• 26 Feb 2024 • 4 min read

Let me be real with you. Training an LLM from scratch is not something you will do on your laptop. Or even on a beefy workstation. We are talking thousands of GPU hours. Millions of dollars in compute.

What you can do is fine-tune. Take a pre-trained model. Adapt it to your domain. This is practical. This is what most of us actually do.

What You Need

Hardware reality check:

Minimum: 16GB GPU (RTX 4080, A4000)
Comfortable: 24GB GPU (RTX 4090, A5000)
Production: Multiple A100s or H100s

No GPU? Use Google Colab Pro ($10/month) or Lambda Labs ($1.10/hour for A10).

Software stack:

pip install torch transformers datasets accelerate peft bitsandbytes

Python 3.10+. CUDA 11.8 or higher. If you need help with GPU setup, our guide on accelerating LLMs with CUDA and Python covers installation and verification in detail.

Pick Your Base Model

Start small. DistilGPT-2 has 82M parameters. Good for learning. Llama-2-7B is production-ready but needs 14GB+ VRAM.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "distilgpt2"  # Start here
# model_name = "meta-llama/Llama-2-7b-hf"  # Graduate to this

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# DistilGPT2 needs a pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

Prepare Your Dataset

Your data quality determines your results. Garbage in, garbage out.

from datasets import load_dataset, Dataset

# Option 1: Use existing datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# Option 2: Load your own data
def load_custom_data(file_path):
    with open(file_path, "r") as f:
        texts = f.read().split("\n\n")  # Split by paragraphs
    return Dataset.from_dict({"text": texts})

# Tokenize everything
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
        return_tensors="pt"
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

512 tokens per sample is a reasonable starting point. Longer contexts need more VRAM.

The Training Loop

Here is a basic training setup. Works on a single GPU.

from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # We're doing causal LM, not masked LM
)

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,  # Lower if OOM
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-5,
    warmup_ratio=0.1,
    weight_decay=0.01,
    logging_steps=100,
    save_steps=500,
    eval_strategy="steps",
    eval_steps=500,
    load_best_model_at_end=True,
    fp16=True,  # Mixed precision, saves memory
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    data_collator=data_collator,
)

trainer.train()

When You Run Out of Memory

You will. Everyone does. Here are your options.

Option 1: Gradient checkpointing

model.gradient_checkpointing_enable()

Trades compute for memory. Training is 20% slower but uses 60% less VRAM.

Option 2: LoRA (Low-Rank Adaptation)

Only train a small adapter. Keeps base model frozen.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["c_attn", "c_proj"],  # GPT-2 specific
    lora_dropout=0.1,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.1% of total

LoRA makes this practical. Fine-tune a 7B model on 8GB VRAM.

Option 3: Quantization

Load model in 8-bit or 4-bit.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)

4-bit + LoRA = QLoRA. This is how people fine-tune 70B models on consumer hardware. We have a complete walkthrough on fine-tuning LLMs with QLoRA on a single GPU if you want the full recipe.

Evaluate Your Model

Perplexity is the standard metric. Lower is better.

import math

eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")

But perplexity does not tell the whole story. Generate some text. Read it. Does it sound right?

def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        top_p=0.9,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_text("The future of AI is"))

Save and Deploy

Save your fine-tuned model:

# Full model
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")

# Or push to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model")

For production inference, look into:

vLLM: Fast inference server, handles batching automatically
text-generation-inference: Hugging Face's production server
llama.cpp: CPU inference, good for edge deployment

Common Mistakes

Learning rate too high: Start at 2e-5. If loss spikes, go lower.
Not enough data: Fine-tuning needs thousands of examples minimum. Tens of thousands for real improvement.
Training too long: Watch validation loss. Stop when it plateaus or rises.
Wrong task format: If you want instruction-following, format data as instructions. Not raw text.
Ignoring the base model: Fine-tuning cannot fix fundamental model limitations. Choose your base wisely.

Realistic Expectations

Fine-tuning adapts style and domain knowledge. It does not create new capabilities. A fine-tuned 7B model will not match GPT-4. If your use case is mostly about answering questions over your own documents, building a RAG system might be a better fit than fine-tuning.

What fine-tuning is good for:

Adapting to your company's writing style
Learning domain-specific terminology
Following a specific output format
Reducing unwanted behaviors

What it cannot do:

Make a small model as smart as a large one
Add knowledge the base model never saw
Fix fundamental reasoning limitations

Start with DistilGPT-2. Get the pipeline working. Then scale up. This approach saves time, money, and frustration.

What You Need

Pick Your Base Model

Prepare Your Dataset

The Training Loop

When You Run Out of Memory

Evaluate Your Model

Save and Deploy

Common Mistakes

Realistic Expectations

Share this article

Related Articles

Fine-Tuning LLMs with QLoRA: Run a 7B Model on a Single GPU

Building RAG from Scratch: A Python Guide with LangChain

PostgreSQL Full-Text Search: When You Don't Need Elasticsearch