2.3 Popular Open SLM Families
🎯 Learning Objectives
- Explore leading open-source SLM families and their characteristics
- Compare capabilities and benchmarks across different model series
- Understand licensing and commercial usage considerations
- Learn practical deployment examples for each family
🔷 Microsoft Phi Series
Phi-3 Family Overview
Philosophy: "Small language models can be as capable as much larger ones when trained on high-quality data"
| Model | Parameters | Context Length | Training Data | Key Strengths |
|---|---|---|---|---|
| Phi-3-Mini | 3.8B | 128K | 3.3T tokens | Reasoning, Math, Code |
| Phi-3-Small | 7B | 128K | 4.8T tokens | Enhanced multilingual |
| Phi-3-Medium | 14B | 128K | 4.8T tokens | Complex reasoning |
📊 Benchmark Performance (Phi-3-Mini vs Competitors)
MMLU:
69%
HumanEval:
61%
GSM8K:
87%
# Phi-3 deployment example
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
# Optimized for instruction following
prompt = "Explain quantum computing in simple terms"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.7)
MIT License
Commercial Use OK
ONNX Support
Mobile Optimized
🔶 Google Gemma Series
Gemma Family Overview
Philosophy: "Built from the same research and technology used to create Gemini models"
| Model | Parameters | Context Length | Variants | Key Features |
|---|---|---|---|---|
| Gemma 2B | 2.5B | 8K | Base, Instruct | Ultra-lightweight |
| Gemma 7B | 8.5B | 8K | Base, Instruct | Balanced performance |
| Gemma 2 9B | 9.2B | 8K | Base, Instruct | Next-gen architecture |
📊 Safety & Responsible AI Focus
- Responsible AI Toolkit: Built-in safety classifiers
- Comprehensive Filtering: Extensive safety training
- Debugging Tools: LIT (Language Interpretability Tool) integration
- Model Cards: Detailed documentation and limitations
# Gemma with safety features
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2b-it",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Built-in instruction following
chat = [
{"role": "user", "content": "Write a Python function to sort a list"},
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(inputs, max_length=500)
Gemma License
Commercial Use*
Safety First
JAX/PyTorch
🔸 Mistral AI Family
Mistral Open Models
Philosophy: "Efficiency and performance through architectural innovations"
Mistral 7B v0.3
| Parameters: | 7.2B |
| Context: | 32K (extended) |
| Architecture: | Transformer + optimizations |
| Training: | High-quality curated data |
Key Innovations:
- Sliding Window Attention
- Group Query Attention (GQA)
- Efficient tokenization
Mixtral 8x7B (MoE)
| Total Params: | 46.7B |
| Active Params: | 12.9B |
| Context: | 32K |
| Experts: | 8 (2 active) |
MoE Benefits:
- Large capacity, small active footprint
- Specialized expert routing
- Better scaling efficiency
📊 Performance Comparison
MMLU (Mistral 7B):
64%
MMLU (Mixtral 8x7B):
71%
# Mistral deployment with vLLM for efficiency
from vllm import LLM, SamplingParams
# High-throughput serving
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
tensor_parallel_size=1,
dtype="half"
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=200
)
prompts = ["Explain machine learning", "Write a Python function"]
outputs = llm.generate(prompts, sampling_params)
Apache 2.0
Commercial Use OK
High Efficiency
vLLM Optimized
🔮 Community & Specialized Models
TinyLlama 1.1B
Focus: Ultra-lightweight deployment
| Parameters: | 1.1B |
| Training Tokens: | 3T |
| Memory: | ~2GB |
| Speed: | Very fast |
# Extremely lightweight deployment
model = AutoModelForCausalLM.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
torch_dtype=torch.float16
)
# Runs on mobile devices!
Apache 2.0
Mobile Ready
🎯 Specialized & Fine-tuned Variants
Code-Specialized
- CodeLlama 7B: Code generation expert
- Phind-CodeLlama: Enhanced for coding
- DeepSeek-Coder: Multi-language coding
Domain-Specific
- MedAlpaca: Medical knowledge
- FinGPT: Financial applications
- LawGPT: Legal document analysis
🎯 Model Selection Guide
| Use Case | Recommended Model | Key Reasons | Alternative |
|---|---|---|---|
| Mobile Apps | Phi-3-Mini (3.8B) | ONNX support, optimized inference | TinyLlama 1.1B |
| Edge Servers | Mistral 7B | Efficiency, sliding window attention | Llama 3.2 3B |
| Code Generation | Phi-3-Mini | Strong HumanEval performance | CodeLlama 7B |
| Safety-Critical | Gemma 7B Instruct | Extensive safety training | Phi-3 with filters |
| High Throughput | Mixtral 8x7B | MoE efficiency, large capacity | Mistral 7B |
| Research/Education | Gemma 2B | Permissive license, documentation | TinyLlama 1.1B |
🏆 Best Practices for SLM Selection
- Benchmark Alignment: Choose models tested on tasks similar to yours
- License Compatibility: Ensure licensing matches your use case (commercial vs research)
- Hardware Constraints: Model size must fit your deployment environment
- Inference Framework: Check compatibility with your serving infrastructure
- Fine-tuning Needs: Some models fine-tune better than others
- Community Support: Active communities provide better long-term support
- Update Frequency: Consider how often models are updated and improved
Pro Tip: Start with the most popular model in your size category (e.g., Phi-3-Mini for 3-4B, Mistral 7B for 7B), then optimize based on your specific requirements.