2.3 Popular Open SLM Families

🎯 Learning Objectives

  • Explore leading open-source SLM families and their characteristics
  • Compare capabilities and benchmarks across different model series
  • Understand licensing and commercial usage considerations
  • Learn practical deployment examples for each family

🔷 Microsoft Phi Series

Phi-3 Family Overview

Philosophy: "Small language models can be as capable as much larger ones when trained on high-quality data"

ModelParametersContext LengthTraining DataKey Strengths
Phi-3-Mini 3.8B 128K 3.3T tokens Reasoning, Math, Code
Phi-3-Small 7B 128K 4.8T tokens Enhanced multilingual
Phi-3-Medium 14B 128K 4.8T tokens Complex reasoning

📊 Benchmark Performance (Phi-3-Mini vs Competitors)

MMLU:
69%
HumanEval:
61%
GSM8K:
87%
# Phi-3 deployment example from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") # Optimized for instruction following prompt = "Explain quantum computing in simple terms" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=200, temperature=0.7)
MIT License Commercial Use OK ONNX Support Mobile Optimized

🔶 Google Gemma Series

Gemma Family Overview

Philosophy: "Built from the same research and technology used to create Gemini models"

ModelParametersContext LengthVariantsKey Features
Gemma 2B 2.5B 8K Base, Instruct Ultra-lightweight
Gemma 7B 8.5B 8K Base, Instruct Balanced performance
Gemma 2 9B 9.2B 8K Base, Instruct Next-gen architecture

📊 Safety & Responsible AI Focus

  • Responsible AI Toolkit: Built-in safety classifiers
  • Comprehensive Filtering: Extensive safety training
  • Debugging Tools: LIT (Language Interpretability Tool) integration
  • Model Cards: Detailed documentation and limitations
# Gemma with safety features from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it") model = AutoModelForCausalLM.from_pretrained( "google/gemma-2b-it", device_map="auto", torch_dtype=torch.bfloat16 ) # Built-in instruction following chat = [ {"role": "user", "content": "Write a Python function to sort a list"}, ] prompt = tokenizer.apply_chat_template(chat, tokenize=False) inputs = tokenizer.encode(prompt, return_tensors="pt") outputs = model.generate(inputs, max_length=500)
Gemma License Commercial Use* Safety First JAX/PyTorch

🔸 Mistral AI Family

Mistral Open Models

Philosophy: "Efficiency and performance through architectural innovations"

Mistral 7B v0.3

Parameters:7.2B
Context:32K (extended)
Architecture:Transformer + optimizations
Training:High-quality curated data
Key Innovations:
  • Sliding Window Attention
  • Group Query Attention (GQA)
  • Efficient tokenization

Mixtral 8x7B (MoE)

Total Params:46.7B
Active Params:12.9B
Context:32K
Experts:8 (2 active)
MoE Benefits:
  • Large capacity, small active footprint
  • Specialized expert routing
  • Better scaling efficiency

📊 Performance Comparison

MMLU (Mistral 7B):
64%
MMLU (Mixtral 8x7B):
71%
# Mistral deployment with vLLM for efficiency from vllm import LLM, SamplingParams # High-throughput serving llm = LLM( model="mistralai/Mistral-7B-Instruct-v0.3", tensor_parallel_size=1, dtype="half" ) sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=200 ) prompts = ["Explain machine learning", "Write a Python function"] outputs = llm.generate(prompts, sampling_params)
Apache 2.0 Commercial Use OK High Efficiency vLLM Optimized

🔮 Community & Specialized Models

TinyLlama 1.1B

Focus: Ultra-lightweight deployment

Parameters:1.1B
Training Tokens:3T
Memory:~2GB
Speed:Very fast
# Extremely lightweight deployment model = AutoModelForCausalLM.from_pretrained( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16 ) # Runs on mobile devices!
Apache 2.0 Mobile Ready

Llama 3.2 (1B & 3B)

Focus: Edge-optimized performance

Llama 3.2 1B:On-device
Llama 3.2 3B:Edge servers
Context:128K
Optimization:Mobile/edge
# Llama 3.2 edge deployment from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B-Instruct", device_map="auto", torch_dtype=torch.bfloat16 ) # Optimized for edge deployment
Llama License Edge Optimized

🎯 Specialized & Fine-tuned Variants

Code-Specialized

  • CodeLlama 7B: Code generation expert
  • Phind-CodeLlama: Enhanced for coding
  • DeepSeek-Coder: Multi-language coding

Domain-Specific

  • MedAlpaca: Medical knowledge
  • FinGPT: Financial applications
  • LawGPT: Legal document analysis

🎯 Model Selection Guide

Use Case Recommended Model Key Reasons Alternative
Mobile Apps Phi-3-Mini (3.8B) ONNX support, optimized inference TinyLlama 1.1B
Edge Servers Mistral 7B Efficiency, sliding window attention Llama 3.2 3B
Code Generation Phi-3-Mini Strong HumanEval performance CodeLlama 7B
Safety-Critical Gemma 7B Instruct Extensive safety training Phi-3 with filters
High Throughput Mixtral 8x7B MoE efficiency, large capacity Mistral 7B
Research/Education Gemma 2B Permissive license, documentation TinyLlama 1.1B

🏆 Best Practices for SLM Selection

  • Benchmark Alignment: Choose models tested on tasks similar to yours
  • License Compatibility: Ensure licensing matches your use case (commercial vs research)
  • Hardware Constraints: Model size must fit your deployment environment
  • Inference Framework: Check compatibility with your serving infrastructure
  • Fine-tuning Needs: Some models fine-tune better than others
  • Community Support: Active communities provide better long-term support
  • Update Frequency: Consider how often models are updated and improved

Pro Tip: Start with the most popular model in your size category (e.g., Phi-3-Mini for 3-4B, Mistral 7B for 7B), then optimize based on your specific requirements.