Let me be real with you. Training an LLM from scratch is not something you will do on your laptop. Or even on a beefy workstation. We are talking thousands of GPU hours. Millions of dollars in compute.
What you can do is fine-tune. Take a pre-trained model. Adapt it to your domain. This is practical. This is what most of us actually do.
What You Need
Hardware reality check:
- Minimum: 16GB GPU (RTX 4080, A4000)
- Comfortable: 24GB GPU (RTX 4090, A5000)
- Production: Multiple A100s or H100s
No GPU? Use Google Colab Pro ($10/month) or Lambda Labs ($1.10/hour for A10).
Software stack:
pip install torch transformers datasets accelerate peft bitsandbytesPython 3.10+. CUDA 11.8 or higher. If you need help with GPU setup, our guide on accelerating LLMs with CUDA and Python covers installation and verification in detail.
Pick Your Base Model
Start small. DistilGPT-2 has 82M parameters. Good for learning. Llama-2-7B is production-ready but needs 14GB+ VRAM.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "distilgpt2" # Start here
# model_name = "meta-llama/Llama-2-7b-hf" # Graduate to this
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# DistilGPT2 needs a pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_idPrepare Your Dataset
Your data quality determines your results. Garbage in, garbage out.
from datasets import load_dataset, Dataset
# Option 1: Use existing datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# Option 2: Load your own data
def load_custom_data(file_path):
with open(file_path, "r") as f:
texts = f.read().split("\n\n") # Split by paragraphs
return Dataset.from_dict({"text": texts})
# Tokenize everything
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length",
return_tensors="pt"
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])512 tokens per sample is a reasonable starting point. Longer contexts need more VRAM.
The Training Loop
Here is a basic training setup. Works on a single GPU.
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # We're doing causal LM, not masked LM
)
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4, # Lower if OOM
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
logging_steps=100,
save_steps=500,
eval_strategy="steps",
eval_steps=500,
load_best_model_at_end=True,
fp16=True, # Mixed precision, saves memory
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
data_collator=data_collator,
)
trainer.train()When You Run Out of Memory
You will. Everyone does. Here are your options.
Option 1: Gradient checkpointing
model.gradient_checkpointing_enable()Trades compute for memory. Training is 20% slower but uses 60% less VRAM.
Option 2: LoRA (Low-Rank Adaptation)
Only train a small adapter. Keeps base model frozen.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["c_attn", "c_proj"], # GPT-2 specific
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.1% of totalLoRA makes this practical. Fine-tune a 7B model on 8GB VRAM.
Option 3: Quantization
Load model in 8-bit or 4-bit.
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)4-bit + LoRA = QLoRA. This is how people fine-tune 70B models on consumer hardware. We have a complete walkthrough on fine-tuning LLMs with QLoRA on a single GPU if you want the full recipe.
Evaluate Your Model
Perplexity is the standard metric. Lower is better.
import math
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")But perplexity does not tell the whole story. Generate some text. Read it. Does it sound right?
def generate_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=max_length,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
top_p=0.9,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate_text("The future of AI is"))Save and Deploy
Save your fine-tuned model:
# Full model
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")
# Or push to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model")For production inference, look into:
- vLLM: Fast inference server, handles batching automatically
- text-generation-inference: Hugging Face's production server
- llama.cpp: CPU inference, good for edge deployment
Common Mistakes
Learning rate too high: Start at 2e-5. If loss spikes, go lower.
Not enough data: Fine-tuning needs thousands of examples minimum. Tens of thousands for real improvement.
Training too long: Watch validation loss. Stop when it plateaus or rises.
Wrong task format: If you want instruction-following, format data as instructions. Not raw text.
Ignoring the base model: Fine-tuning cannot fix fundamental model limitations. Choose your base wisely.
Realistic Expectations
Fine-tuning adapts style and domain knowledge. It does not create new capabilities. A fine-tuned 7B model will not match GPT-4. If your use case is mostly about answering questions over your own documents, building a RAG system might be a better fit than fine-tuning.
What fine-tuning is good for:
- Adapting to your company's writing style
- Learning domain-specific terminology
- Following a specific output format
- Reducing unwanted behaviors
What it cannot do:
- Make a small model as smart as a large one
- Add knowledge the base model never saw
- Fix fundamental reasoning limitations
Start with DistilGPT-2. Get the pipeline working. Then scale up. This approach saves time, money, and frustration.