2.2 Distillation & Quantization
🎯 Learning Objectives
- Understand knowledge distillation techniques for creating efficient SLMs
- Master quantization methods (8-bit, 4-bit) for model compression
- Explore QLoRA and other parameter-efficient fine-tuning approaches
- Analyze performance trade-offs in model compression
🧠 Knowledge Distillation
Knowledge Distillation is a technique where a smaller "student" model learns to mimic the behavior of a larger "teacher" model, capturing its knowledge in a more compact form.
🎓 Teacher Model
Large, powerful model
(e.g., GPT-4, 175B params)
📚 Knowledge Transfer
Soft targets, attention maps
intermediate representations
🎯 Student Model
Compact, efficient model
(e.g., 1-7B params)
🔬 Distillation Process
Step 1: Teacher Model Preparation
Select a high-performing large model as the teacher. This model should excel at the target tasks.
Step 2: Student Architecture Design
Create a smaller model with fewer layers, smaller hidden dimensions, or different architecture optimizations.
Step 3: Distillation Training
Train the student model to match both the teacher's outputs and the ground truth labels.
📊 Distillation Results
| Model | Parameters | GLUE Score | Inference Speed | Memory (GB) |
|---|---|---|---|---|
| Teacher (BERT-Large) | 340M | 84.3 | 1x | 1.3 |
| Student (DistilBERT) | 66M | 81.2 | 2x | 0.3 |
| Performance Retention | 19% | 96% | 200% | 23% |
⚖️ Quantization Techniques
Quantization reduces model size and inference time by using lower-precision number representations (e.g., INT8 instead of FP32).
🔢 Precision Levels
FP32 (Full Precision)
Size: 4 bytes/param
Range: ±3.4 × 10³⁸
Use: Training, highest accuracy
FP16 (Half Precision)
Size: 2 bytes/param
Range: ±6.5 × 10⁴
Use: Modern GPUs, good balance
INT8 (8-bit Integer)
Size: 1 byte/param
Range: -128 to 127
Use: Edge devices, 4x compression
INT4 (4-bit Integer)
Size: 0.5 bytes/param
Range: -8 to 7
Use: Extreme compression, 8x smaller
🛠️ Quantization Methods
📐 Post-Training Quantization (PTQ)
Quantize an already-trained model without additional training.
Pros: Fast, no retraining needed
Cons: Potential accuracy loss
🎯 Quantization-Aware Training (QAT)
Train the model with quantization simulation to maintain accuracy.
Pros: Better accuracy retention
Cons: Requires retraining time
📈 Quantization Performance Impact
🔧 QLoRA: Quantized Low-Rank Adaptation
QLoRA combines quantization with Low-Rank Adaptation (LoRA) to enable efficient fine-tuning of large models on consumer hardware.
🏗️ QLoRA Architecture
Core Components
- 4-bit Quantization: Base model stored in 4-bit precision
- LoRA Adapters: Small trainable matrices added to key layers
- Gradient Checkpointing: Memory optimization during training
- Paged Optimizers: Handle memory spikes efficiently
💡 QLoRA Benefits
| Method | Memory (GB) | Training Time | Performance Retention | Hardware Requirements |
|---|---|---|---|---|
| Full Fine-tuning | 80-120 | 1x | 100% | Multiple A100s |
| LoRA | 40-60 | 0.8x | 95-98% | Single A100 |
| QLoRA | 12-16 | 0.9x | 90-95% | RTX 3090/4090 |
🚀 Advanced Compression Techniques
✂️ Structured Pruning
Remove entire neurons, attention heads, or layers based on importance metrics.
🎭 Progressive Distillation
Gradually reduce model size through multiple distillation stages.
🔄 Dynamic Quantization
Adapt quantization precision based on layer importance and input characteristics.
- Mixed Precision: Different layers use different bit widths
- Adaptive Quantization: Adjust precision based on activation ranges
- Channel-wise Quantization: Per-channel scaling for better accuracy
🎯 Best Practices for Model Compression
- Start with Distillation: Often provides the best accuracy-size trade-off
- Combine Techniques: Use distillation + quantization + pruning together
- Calibration Data Quality: Use representative data for quantization calibration
- Task-Specific Optimization: Tailor compression to your specific use case
- Iterative Approach: Gradually increase compression while monitoring performance
- Hardware Consideration: Choose techniques compatible with target deployment hardware
Key Insight: The goal isn't just smaller models, but maintaining the right balance between size, speed, and accuracy for your specific application requirements.