3.1 Architecture Recap & Scaling Laws
šÆ Learning Objectives
- Understand the transformer architecture that powers modern LLMs
- Explore scaling laws and parameter count implications
- Learn computational requirements for training and inference
- Analyze the relationship between model size and capabilities
šļø Transformer Architecture Deep Dive
LLM Architecture Stack
Output Layer
Vocabulary projection + Softmax
Vocabulary projection + Softmax
Transformer Layers (NĆ)
Typically 12-96+ layers
Typically 12-96+ layers
Multi-Head Attention
Self-attention mechanism
Self-attention mechanism
Feed Forward Network
2-layer MLP with activation
2-layer MLP with activation
Note: Each transformer layer includes residual connections, layer normalization, and dropout for training stability.
š Multi-Head Attention Example
For a 12-layer model with 12 attention heads per layer:
Head 1
Head 2
Head 3
...
Head 12
Each head learns different linguistic patterns: syntax, semantics, long-range dependencies
š§® Attention Mechanism (Simplified)
def multi_head_attention(query, key, value, num_heads):
"""
Multi-head self-attention mechanism
Args:
query, key, value: Input tensors [batch, seq_len, d_model]
num_heads: Number of attention heads
"""
batch_size, seq_len, d_model = query.shape
head_dim = d_model // num_heads
# Split into multiple heads
q = query.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
k = key.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
v = value.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
# Scaled dot-product attention
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(head_dim)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, v)
# Concatenate heads
output = output.transpose(1, 2).contiguous().view(
batch_size, seq_len, d_model
)
return output
# Example usage for GPT-4 scale
# - 96 layers
# - 128 attention heads per layer
# - 12,288 dimensional embeddings
# - 2 trillion parameters total
š Scaling Laws & Parameter Growth
š¢ Model Scale Evolution (2018-2024)
| Model Component | GPT-3 (175B) | PaLM (540B) | GPT-4 (~1.8T) | Impact on Capabilities |
|---|---|---|---|---|
| Layers | 96 | 118 | ~120 | Deeper reasoning chains |
| Hidden Size | 12,288 | 18,432 | ~20,480 | Richer representations |
| Attention Heads | 96 | 48 | ~128 | More diverse attention patterns |
| Context Length | 2,048 | 2,048 | 8,192-128K | Long-form understanding |
| Vocab Size | 50,257 | 256,000 | ~100,000 | Better tokenization efficiency |
š§ Chinchilla Scaling Laws (Hoffmann et al., 2022)
Key Insight: Optimal model performance requires balanced scaling of parameters and training data.
- Parameter-Data Ratio: ~20 tokens per parameter for optimal training
- Compute Scaling: C ā N^1.34 Ć D^1.34 (N=params, D=data, C=compute)
- Implication: Many large models are undertrained, not oversized
ā” Computational Requirements & Hardware
šļø Training Requirements
| GPT-3 (175B): | ~3.14 à 10²³ FLOPs |
| Training Time: | ~34 days on 1024 A100s |
| Energy Cost: | ~1,287 MWh |
| Estimated Cost: | $4.6M (compute only) |
Note: GPT-4 training estimated at 10-100x these costs
š Inference Requirements
| Memory (FP16): | ~350GB for 175B model |
| GPU Requirements: | 8ĆA100 (80GB) minimum |
| Throughput: | ~20-50 tokens/sec |
| Cost per 1M tokens: | $1-20 depending on provider |
Optimization: Quantization can reduce memory by 2-4x
š¾ Memory Breakdown
| Model Weights: | 70-80% |
| KV Cache: | 10-20% |
| Activations: | 5-10% |
| Framework Overhead: | 2-5% |
KV Cache: Grows linearly with context length
š§® Memory Calculator Example
def calculate_llm_memory(num_params, precision="fp16", batch_size=1, seq_len=2048):
"""
Calculate LLM memory requirements
"""
# Model weights
bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
model_memory_gb = (num_params * bytes_per_param[precision]) / (1024**3)
# KV cache (approximate)
# For each layer: 2 * hidden_size * seq_len * batch_size * precision
hidden_size = int((num_params / 175e9) * 12288) # Scale from GPT-3
num_layers = int((num_params / 175e9) * 96) # Scale from GPT-3
kv_cache_gb = (2 * hidden_size * seq_len * batch_size *
bytes_per_param[precision] * num_layers) / (1024**3)
# Activation memory (rough estimate)
activation_gb = model_memory_gb * 0.1
total_memory_gb = model_memory_gb + kv_cache_gb + activation_gb
return {
"model_weights_gb": round(model_memory_gb, 1),
"kv_cache_gb": round(kv_cache_gb, 1),
"activations_gb": round(activation_gb, 1),
"total_memory_gb": round(total_memory_gb, 1)
}
# Example: GPT-3 scale model
result = calculate_llm_memory(175e9, "fp16", batch_size=1, seq_len=2048)
print(f"GPT-3 Memory Requirements: {result}")
# Output: {'model_weights_gb': 325.0, 'kv_cache_gb': 24.6, 'activations_gb': 32.5, 'total_memory_gb': 382.1}
š¬ Architectural Innovations & Optimizations
šÆ Attention Optimizations
- Flash Attention: Memory-efficient attention computation
- Multi-Query Attention (MQA): Shared key/value across heads
- Grouped Query Attention (GQA): Balance between MHA and MQA
- Sparse Attention: Reduce O(n²) complexity
Impact: 2-8x speedup with minimal quality loss
š§ Activation Functions
- SwiGLU: Swish + Gated Linear Unit (PaLM, LLaMA)
- GeGLU: GELU + Gated Linear Unit (T5)
- RMSNorm: Root Mean Square Layer Normalization
- Rotary Position Encoding (RoPE): Better position understanding
Benefit: Better training dynamics and performance
š Parallel Training Strategies
- Data Parallelism: Distribute batches across GPUs
- Model Parallelism: Split model layers across devices
- Pipeline Parallelism: Layer-wise execution pipelining
- Tensor Parallelism: Split individual operations
Result: Train models larger than single GPU memory
ā” Flash Attention Implementation Concept
def flash_attention(Q, K, V, block_size=128):
"""
Memory-efficient attention using block-wise computation
Reduces memory complexity from O(N²) to O(N)
"""
N, d = Q.shape
O = torch.zeros_like(Q)
l = torch.zeros(N)
m = torch.full((N,), float('-inf'))
# Process in blocks to fit in fast memory (SRAM)
for j in range(0, N, block_size):
# Load blocks of K and V
K_j = K[j:j+block_size]
V_j = V[j:j+block_size]
for i in range(0, N, block_size):
# Load block of Q
Q_i = Q[i:i+block_size]
# Compute attention scores for this block
S_ij = Q_i @ K_j.T / math.sqrt(d)
# Online softmax update (numerically stable)
m_new = torch.maximum(m[i:i+block_size], S_ij.max(dim=1).values)
l_new = torch.exp(m[i:i+block_size] - m_new) * l[i:i+block_size] + \
torch.exp(S_ij - m_new.unsqueeze(1)).sum(dim=1)
# Update output
O[i:i+block_size] = (O[i:i+block_size] * torch.exp(m[i:i+block_size] - m_new).unsqueeze(1) *
(l[i:i+block_size] / l_new).unsqueeze(1) +
torch.exp(S_ij - m_new.unsqueeze(1)) @ V_j) / l_new.unsqueeze(1)
# Update statistics
m[i:i+block_size] = m_new
l[i:i+block_size] = l_new
return O
# This enables training much longer sequences with the same memory
š LLM Architecture Key Takeaways
Scaling Principles:
- šÆ Balance parameters, data, and compute (Chinchilla laws)
- š More layers ā better reasoning capabilities
- š More attention heads ā diverse pattern recognition
- š Longer context ā better long-form understanding
Optimization Priorities:
- ā” Memory efficiency (Flash Attention, quantization)
- š Training stability (better norms, activations)
- š Inference speed (KV caching, model parallelism)
- š° Cost efficiency (MoE, pruning, distillation)
š” Future Trends: Mixture of Experts (MoE), multimodal architectures, and specialized hardware (TPUs, neuromorphic chips) are pushing the boundaries of what's possible with LLM architectures.