Let me be real with you. Training an LLM from scratch is not something you will do on your laptop. Or even on a beefy workstation. We are talking thousands of GPU hours. Millions of dollars in compute.
What you can do is fine-tune. Take a pre-trained model. Adapt it to your domain. This is practical. This is what most of us actually do.
What You Need
Hardware reality check:
- Minimum: 16GB GPU (RTX 4080, A4000)
- Comfortable: 24GB GPU (RTX 4090, A5000)
- Production: Multiple A100s or H100s
No GPU? Use Google Colab Pro ($10/month) or Lambda Labs ($1.10/hour for A10).
Software stack:
pip install torch transformers datasets accelerate peft bitsandbytesPython 3.10+. CUDA 11.8 or higher.
Pick Your Base Model
Start small. DistilGPT-2 has 82M parameters. Good for learning. Llama-2-7B is production-ready but needs 14GB+ VRAM.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "distilgpt2" # Start here
# model_name = "meta-llama/Llama-2-7b-hf" # Graduate to this
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# DistilGPT2 needs a pad token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_idPrepare Your Dataset
Your data quality determines your results. Garbage in, garbage out.
from datasets import load_dataset, Dataset
# Option 1: Use existing datasets
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
# Option 2: Load your own data
def load_custom_data(file_path):
with open(file_path, "r") as f:
texts = f.read().split("\n\n") # Split by paragraphs
return Dataset.from_dict({"text": texts})
# Tokenize everything
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length",
return_tensors="pt"
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])512 tokens per sample is a reasonable starting point. Longer contexts need more VRAM.
The Training Loop
Here is a basic training setup. Works on a single GPU.
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False # We're doing causal LM, not masked LM
)
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4, # Lower if OOM
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-5,
warmup_ratio=0.1,
weight_decay=0.01,
logging_steps=100,
save_steps=500,
eval_strategy="steps",
eval_steps=500,
load_best_model_at_end=True,
fp16=True, # Mixed precision, saves memory
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
data_collator=data_collator,
)
trainer.train()When You Run Out of Memory
You will. Everyone does. Here are your options.
Option 1: Gradient checkpointing
model.gradient_checkpointing_enable()Trades compute for memory. Training is 20% slower but uses 60% less VRAM.
Option 2: LoRA (Low-Rank Adaptation)
Only train a small adapter. Keeps base model frozen.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["c_attn", "c_proj"], # GPT-2 specific
lora_dropout=0.1,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 0.1% of totalLoRA makes this practical. Fine-tune a 7B model on 8GB VRAM.
Option 3: Quantization
Load model in 8-bit or 4-bit.
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)4-bit + LoRA = QLoRA. This is how people fine-tune 70B models on consumer hardware.
Evaluate Your Model
Perplexity is the standard metric. Lower is better.
import math
eval_results = trainer.evaluate()
perplexity = math.exp(eval_results["eval_loss"])
print(f"Perplexity: {perplexity:.2f}")But perplexity does not tell the whole story. Generate some text. Read it. Does it sound right?
def generate_text(prompt, max_length=100):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_length=max_length,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
top_p=0.9,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate_text("The future of AI is"))Save and Deploy
Save your fine-tuned model:
# Full model
model.save_pretrained("./my-fine-tuned-model")
tokenizer.save_pretrained("./my-fine-tuned-model")
# Or push to Hugging Face Hub
model.push_to_hub("your-username/my-fine-tuned-model")For production inference, look into:
- vLLM: Fast inference server, handles batching automatically
- text-generation-inference: Hugging Face's production server
- llama.cpp: CPU inference, good for edge deployment
Common Mistakes
Learning rate too high: Start at 2e-5. If loss spikes, go lower.
Not enough data: Fine-tuning needs thousands of examples minimum. Tens of thousands for real improvement.
Training too long: Watch validation loss. Stop when it plateaus or rises.
Wrong task format: If you want instruction-following, format data as instructions. Not raw text.
Ignoring the base model: Fine-tuning cannot fix fundamental model limitations. Choose your base wisely.
Realistic Expectations
Fine-tuning adapts style and domain knowledge. It does not create new capabilities. A fine-tuned 7B model will not match GPT-4.
What fine-tuning is good for:
- Adapting to your company's writing style
- Learning domain-specific terminology
- Following a specific output format
- Reducing unwanted behaviors
What it cannot do:
- Make a small model as smart as a large one
- Add knowledge the base model never saw
- Fix fundamental reasoning limitations
Start with DistilGPT-2. Get the pipeline working. Then scale up. This approach saves time, money, and frustration.