11.4 Fine-Tuning & LoRA Adaptation

While pre-trained foundation models are powerful, adapting them to specific tasks or domains can significantly improve performance. Fine-tuning is the process of updating a model's weights on a new dataset. However, full fine-tuning is computationally expensive. This has led to the development of Parameter-Efficient Fine-Tuning (PEFT) methods, with Low-Rank Adaptation (LoRA) being one of the most popular.

Interactive LoRA Visualization

The diagram below illustrates the difference between full fine-tuning and LoRA. Use the button to toggle between the two modes.

LoRA Mode

In LoRA, the original model weights (blue) are frozen. Small, trainable "adapter" layers (orange) are injected into the model. Only these adapters are updated during training, representing a tiny fraction of the total parameters. This drastically reduces memory requirements and training time.

Full Fine-Tuning Mode

In full fine-tuning, all the weights of the pre-trained model are updated during training. This is highly effective but requires significant computational resources and memory, making it impractical for very large models.

How LoRA Works

LoRA is based on the observation that the change in weights during fine-tuning often has a low "intrinsic rank." Instead of updating the full weight matrix W, LoRA approximates the update ΔW with two smaller, low-rank matrices: ΔW ≈ BA.

  • Frozen Weights: The original pre-trained model weights are kept frozen and are not updated during training.
  • Adapter Layers: Small, trainable matrices (A and B) are added alongside the original weight matrices. The number of trainable parameters in A and B is far less than in the original matrix W.
  • Efficiency: Since only the adapter layers are trained, the memory footprint is much smaller, and training is faster. Multiple sets of adapters can be trained for different tasks and swapped out as needed without storing multiple copies of the full model.