Fine-Tuning and Adaptation

Customizing models for your domain

Why Fine-Tune?

Pre-trained models are generalists. They've absorbed vast knowledge from the internet—conversational ability, factual recall, reasoning patterns, code understanding—but they're not optimized for your specific task.

Your domain has its own vocabulary. Legal documents use terms like "estoppel" and "force majeure." Medical records contain abbreviations like "q.i.d." and "PRN." Code assistants need to understand your company's internal APIs and coding conventions.

Your domain has its own style. Customer service responses should be warm and empathetic. Technical documentation should be precise and formal. Creative writing should be evocative and surprising. A generalist model produces generic output; a fine-tuned model speaks your language.

Your domain has its own knowledge. What's proprietary to your organization isn't on the public internet. The pre-trained model has never seen your internal wikis, your customer data, your specialized procedures.

Fine-tuning bridges this gap. You take a powerful foundation—all that general capability the model already has—and specialize it for your specific needs.

Full Fine-Tuning

The most straightforward approach: update all the model's parameters on your task-specific data.

You start with the pre-trained weights and continue training on your dataset. Every weight in every layer can change to better fit your data. This gives the model maximum flexibility to adapt.

For a 7B parameter model, this means 7 billion numbers can shift and adjust to your task. The model can completely rewire its internal representations if needed.

The power of full fine-tuning comes with significant costs:

Memory explosion. Training requires storing gradients and optimizer states for every parameter. For Adam optimizer, that's roughly 12 bytes per parameter in mixed precision. A 7B model needs 84GB just for optimizer states—plus the model weights themselves, plus activations during forward pass.

Catastrophic forgetting. As the model adapts aggressively to your narrow dataset, it may "forget" the general capabilities that made it useful in the first place. You optimize for legal text, and suddenly it can't hold a normal conversation.

Need for large datasets. With billions of parameters free to change, you need substantial data to guide them. Too little data leads to overfitting—the model memorizes your examples rather than learning general patterns.

Full fine-tuning remains the gold standard when you have abundant data, compute, and need maximum adaptation. But for most practical applications, there's a better way.

LoRA: Low-Rank Adaptation

Here's the key insight that makes efficient fine-tuning possible: the weight changes during fine-tuning are low-rank.

When you fine-tune a model, you're not completely rewriting every connection. Most of the pre-trained knowledge stays intact. The changes can be captured by a much smaller number of parameters than the full weight matrices.

LoRA (Low-Rank Adaptation) exploits this beautifully. Instead of modifying the original weight matrix $W$ directly, you add a low-rank decomposition:

W' = W + BA

Where:

$W$ is the original pre-trained weight (frozen, never changes)
$B$ is a matrix of shape $(d \times r)$
$A$ is a matrix of shape $(r \times k)$
$r$ is the rank—typically 4, 8, or 16

Interactive: LoRA Weight Decomposition

LoRA Rank (r = 4)4

W (frozen)

16 × 12

B (trainable)

16 × 4

A (trainable)

4 × 12

Original Parameters

192

d × k = 16 × 12

LoRA Parameters

112

r × (d + k) = 4 × (16 + 12)

Parameter Reduction: 41.7%(112 vs 192 parameters)

Adjust the rank to see how it affects parameter count. Lower rank means fewer trainable parameters but less expressiveness. Rank 4-16 typically works well.

The original matrix $W$ might have millions of parameters. But $BA$ together have only $r \times (d + k)$ parameters. If your original layer is 4096×4096 (16 million parameters) and you use rank 8, LoRA adds only $8 \times (4096 + 4096) = 65{,}536$ trainable parameters—less than 0.5% of the original.

During training:

Forward pass: Compute both $Wx$ and $BAx$ , add them together
Backward pass: Only update $B$ and $A$ ; $W$ stays frozen
The frozen $W$ means no optimizer states needed for most parameters

During inference, you can merge the LoRA weights into the base model:

W_{\text{merged}} = W + BA

This is a one-time matrix addition. After merging, inference is exactly as fast as the original model—no additional latency, no architectural changes. The LoRA adapters effectively become invisible.

Even better: you can train multiple LoRA adapters for different tasks and swap them at inference time. One base model, many specializations.

Other PEFT Methods

LoRA isn't the only parameter-efficient fine-tuning (PEFT) technique. The field has developed several complementary approaches.

Adapters insert small bottleneck layers between transformer blocks. Each adapter is a down-projection followed by a nonlinearity and up-projection:

\text{Adapter}(x) = x + f(xW_{\text{down}})W_{\text{up}}

The bottleneck (down then up) keeps parameter count low. Unlike LoRA, adapters add layers rather than modifying existing weights, which introduces small inference overhead.

Prefix Tuning prepends learnable "soft prompt" vectors to the input. Instead of finding the right words to prompt the model, you learn continuous vectors that can express things no natural language prompt could. The base model remains completely frozen; only the prefix vectors are trained.

QLoRA combines LoRA with aggressive quantization. The base model is quantized to 4-bit precision, dramatically reducing memory. LoRA adapters are still trained in higher precision (16-bit) for stability. This allows fine-tuning a 65B model on a single 48GB GPU—previously impossible without distributed systems.

Memory Comparison: Fine-Tuning Methods

GPU Memory Required (7B parameters)

Full Fine-Tune84 GB

All parameters + optimizer states

LoRA (r=8)16 GB

Base model + small trainable adapters

QLoRA (4-bit)6 GB

Quantized base + LoRA adapters

Common GPU memory limits:

RTX 4090: 24GBA6000: 48GBH100: 80GB

QLoRA makes 7B fine-tuning possible on a gaming GPU.

Memory requirements assume batch size 1 with gradient checkpointing. Full fine-tuning is often impossible without multi-GPU setups.

Each method offers different trade-offs between parameter efficiency, performance, and complexity. LoRA has emerged as the most popular choice for its simplicity and effectiveness, but the right choice depends on your constraints.

When NOT to Fine-Tune

Fine-tuning isn't always the answer. Before investing in training, consider whether simpler approaches might work.

Few-shot prompting can handle many tasks without any training. By including examples in your prompt, the model learns the pattern at inference time. This is surprisingly effective for format conversion, classification, and simple transformations. No data preparation, no training infrastructure, instant iteration.

Retrieval-Augmented Generation (RAG) excels when the problem is knowledge, not behavior. If you need the model to know about your company's products, current events, or private documents, injecting that information at query time often beats trying to bake it into weights. RAG stays current automatically as your knowledge base updates.

The decision framework:

Use fine-tuning when you need to change how the model responds—its style, tone, format, or reasoning approach
Use RAG when you need to change what the model knows—facts, documents, current information
Use few-shot prompting when you need quick experiments or the task is simple enough

Often the best systems combine all three: a fine-tuned model that responds in your preferred style, augmented with retrieved knowledge, steered by carefully crafted prompts.

The goal isn't to choose one technique. It's to understand when each shines, so you reach for the right tool.

Key Takeaways

Pre-trained models are generalists; fine-tuning specializes them for your domain's vocabulary, style, and knowledge
Full fine-tuning updates all parameters—maximum flexibility but requires enormous memory and risks catastrophic forgetting
LoRA adds low-rank matrices ( $W' = W + BA$ ) that capture adaptation with less than 1% of parameters
LoRA adapters merge into base weights at inference, adding zero latency overhead
QLoRA combines 4-bit quantization with LoRA, enabling fine-tuning of large models on modest hardware
Fine-tuning changes model behavior; RAG provides knowledge—choose based on what you need to modify