← Previous Contents Next →

2.3 Popular Open SLM Families

        🎯 Learning Objectives
        Explore leading open-source SLM families and their characteristics
Compare capabilities and benchmarks across different model series
Understand licensing and commercial usage considerations
Learn practical deployment examples for each family

    

🔷 Microsoft Phi Series

Phi-3 Family Overview

Philosophy: "Small language models can be as capable as much larger ones when trained on high-quality data"

Model	Parameters	Context Length	Training Data	Key Strengths
Phi-3-Mini	3.8B	128K	3.3T tokens	Reasoning, Math, Code
Phi-3-Small	7B	128K	4.8T tokens	Enhanced multilingual
Phi-3-Medium	14B	128K	4.8T tokens	Complex reasoning

📊 Benchmark Performance (Phi-3-Mini vs Competitors)

MMLU:

69%

HumanEval:

61%

GSM8K:

87%

# Phi-3 deployment example from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "microsoft/Phi-3-mini-4k-instruct", torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct") # Optimized for instruction following prompt = "Explain quantum computing in simple terms" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate(**inputs, max_length=200, temperature=0.7)

MIT License Commercial Use OK ONNX Support Mobile Optimized

🔶 Google Gemma Series

Gemma Family Overview

Philosophy: "Built from the same research and technology used to create Gemini models"

Model	Parameters	Context Length	Variants	Key Features
Gemma 2B	2.5B	8K	Base, Instruct	Ultra-lightweight
Gemma 7B	8.5B	8K	Base, Instruct	Balanced performance
Gemma 2 9B	9.2B	8K	Base, Instruct	Next-gen architecture

📊 Safety & Responsible AI Focus

Responsible AI Toolkit: Built-in safety classifiers
Comprehensive Filtering: Extensive safety training
Debugging Tools: LIT (Language Interpretability Tool) integration
Model Cards: Detailed documentation and limitations

# Gemma with safety features from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it") model = AutoModelForCausalLM.from_pretrained( "google/gemma-2b-it", device_map="auto", torch_dtype=torch.bfloat16 ) # Built-in instruction following chat = [ {"role": "user", "content": "Write a Python function to sort a list"}, ] prompt = tokenizer.apply_chat_template(chat, tokenize=False) inputs = tokenizer.encode(prompt, return_tensors="pt") outputs = model.generate(inputs, max_length=500)

Gemma License Commercial Use* Safety First JAX/PyTorch

🔸 Mistral AI Family

Mistral Open Models

Philosophy: "Efficiency and performance through architectural innovations"

Mistral 7B v0.3

Parameters:	7.2B
Context:	32K (extended)
Architecture:	Transformer + optimizations
Training:	High-quality curated data

Key Innovations:

Sliding Window Attention
Group Query Attention (GQA)
Efficient tokenization

Mixtral 8x7B (MoE)

Total Params:	46.7B
Active Params:	12.9B
Context:	32K
Experts:	8 (2 active)

MoE Benefits:

Large capacity, small active footprint
Specialized expert routing
Better scaling efficiency

📊 Performance Comparison

MMLU (Mistral 7B):

64%

MMLU (Mixtral 8x7B):

71%

# Mistral deployment with vLLM for efficiency from vllm import LLM, SamplingParams # High-throughput serving llm = LLM( model="mistralai/Mistral-7B-Instruct-v0.3", tensor_parallel_size=1, dtype="half" ) sampling_params = SamplingParams( temperature=0.7, top_p=0.9, max_tokens=200 ) prompts = ["Explain machine learning", "Write a Python function"] outputs = llm.generate(prompts, sampling_params)

Apache 2.0 Commercial Use OK High Efficiency vLLM Optimized

🔮 Community & Specialized Models

TinyLlama 1.1B

Focus: Ultra-lightweight deployment

Parameters:	1.1B
Training Tokens:	3T
Memory:	~2GB
Speed:	Very fast

# Extremely lightweight deployment model = AutoModelForCausalLM.from_pretrained( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", torch_dtype=torch.float16 ) # Runs on mobile devices!

Apache 2.0 Mobile Ready

Llama 3.2 (1B & 3B)

Focus: Edge-optimized performance

Llama 3.2 1B:	On-device
Llama 3.2 3B:	Edge servers
Context:	128K
Optimization:	Mobile/edge

# Llama 3.2 edge deployment from transformers import LlamaForCausalLM model = LlamaForCausalLM.from_pretrained( "meta-llama/Llama-3.2-3B-Instruct", device_map="auto", torch_dtype=torch.bfloat16 ) # Optimized for edge deployment

Llama License Edge Optimized

🎯 Specialized & Fine-tuned Variants

Code-Specialized

CodeLlama 7B: Code generation expert
Phind-CodeLlama: Enhanced for coding
DeepSeek-Coder: Multi-language coding

Domain-Specific

MedAlpaca: Medical knowledge
FinGPT: Financial applications
LawGPT: Legal document analysis

🎯 Model Selection Guide

Use Case	Recommended Model	Key Reasons	Alternative
Mobile Apps	Phi-3-Mini (3.8B)	ONNX support, optimized inference	TinyLlama 1.1B
Edge Servers	Mistral 7B	Efficiency, sliding window attention	Llama 3.2 3B
Code Generation	Phi-3-Mini	Strong HumanEval performance	CodeLlama 7B
Safety-Critical	Gemma 7B Instruct	Extensive safety training	Phi-3 with filters
High Throughput	Mixtral 8x7B	MoE efficiency, large capacity	Mistral 7B
Research/Education	Gemma 2B	Permissive license, documentation	TinyLlama 1.1B

🏆 Best Practices for SLM Selection

Benchmark Alignment: Choose models tested on tasks similar to yours
License Compatibility: Ensure licensing matches your use case (commercial vs research)
Hardware Constraints: Model size must fit your deployment environment
Inference Framework: Check compatibility with your serving infrastructure
Fine-tuning Needs: Some models fine-tune better than others
Community Support: Active communities provide better long-term support
Update Frequency: Consider how often models are updated and improved

Pro Tip: Start with the most popular model in your size category (e.g., Phi-3-Mini for 3-4B, Mistral 7B for 7B), then optimize based on your specific requirements.

← Previous: Distillation & Quantization Next: When to Choose SLMs →

← Previous Contents Next →