3.1 Architecture Recap & Scaling Laws

šŸŽÆ Learning Objectives

  • Understand the transformer architecture that powers modern LLMs
  • Explore scaling laws and parameter count implications
  • Learn computational requirements for training and inference
  • Analyze the relationship between model size and capabilities

šŸ—ļø Transformer Architecture Deep Dive

LLM Architecture Stack

Output Layer
Vocabulary projection + Softmax
Transformer Layers (NƗ)
Typically 12-96+ layers
Multi-Head Attention
Self-attention mechanism
Feed Forward Network
2-layer MLP with activation
Token + Position Embeddings
Input representation

Note: Each transformer layer includes residual connections, layer normalization, and dropout for training stability.

šŸ” Multi-Head Attention Example

For a 12-layer model with 12 attention heads per layer:

Head 1
Head 2
Head 3
...
Head 12

Each head learns different linguistic patterns: syntax, semantics, long-range dependencies

🧮 Attention Mechanism (Simplified)

def multi_head_attention(query, key, value, num_heads):
    """
    Multi-head self-attention mechanism
    Args:
        query, key, value: Input tensors [batch, seq_len, d_model]
        num_heads: Number of attention heads
    """
    batch_size, seq_len, d_model = query.shape
    head_dim = d_model // num_heads
    
    # Split into multiple heads
    q = query.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    k = key.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    v = value.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    
    # Scaled dot-product attention
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(head_dim)
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, v)
    
    # Concatenate heads
    output = output.transpose(1, 2).contiguous().view(
        batch_size, seq_len, d_model
    )
    return output

# Example usage for GPT-4 scale
# - 96 layers
# - 128 attention heads per layer  
# - 12,288 dimensional embeddings
# - 2 trillion parameters total

šŸ“ˆ Scaling Laws & Parameter Growth

šŸ”¢ Model Scale Evolution (2018-2024)

BERT-Base (2018)
110M
110M
GPT-2 (2019)
1.5B
1.5B
GPT-3 (2020)
175B
175B
PaLM (2022)
540B
540B
GPT-4 (2023)
~1.8T
~1.8T
Model Component GPT-3 (175B) PaLM (540B) GPT-4 (~1.8T) Impact on Capabilities
Layers 96 118 ~120 Deeper reasoning chains
Hidden Size 12,288 18,432 ~20,480 Richer representations
Attention Heads 96 48 ~128 More diverse attention patterns
Context Length 2,048 2,048 8,192-128K Long-form understanding
Vocab Size 50,257 256,000 ~100,000 Better tokenization efficiency

🧠 Chinchilla Scaling Laws (Hoffmann et al., 2022)

Key Insight: Optimal model performance requires balanced scaling of parameters and training data.

  • Parameter-Data Ratio: ~20 tokens per parameter for optimal training
  • Compute Scaling: C āˆ N^1.34 Ɨ D^1.34 (N=params, D=data, C=compute)
  • Implication: Many large models are undertrained, not oversized

⚔ Computational Requirements & Hardware

šŸ‹ļø Training Requirements

GPT-3 (175B):~3.14 Ɨ 10²³ FLOPs
Training Time:~34 days on 1024 A100s
Energy Cost:~1,287 MWh
Estimated Cost:$4.6M (compute only)
Note: GPT-4 training estimated at 10-100x these costs

šŸš€ Inference Requirements

Memory (FP16):~350GB for 175B model
GPU Requirements:8ƗA100 (80GB) minimum
Throughput:~20-50 tokens/sec
Cost per 1M tokens:$1-20 depending on provider
Optimization: Quantization can reduce memory by 2-4x

šŸ’¾ Memory Breakdown

Model Weights:70-80%
KV Cache:10-20%
Activations:5-10%
Framework Overhead:2-5%
KV Cache: Grows linearly with context length

🧮 Memory Calculator Example

def calculate_llm_memory(num_params, precision="fp16", batch_size=1, seq_len=2048):
    """
    Calculate LLM memory requirements
    """
    # Model weights
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
    model_memory_gb = (num_params * bytes_per_param[precision]) / (1024**3)
    
    # KV cache (approximate)
    # For each layer: 2 * hidden_size * seq_len * batch_size * precision
    hidden_size = int((num_params / 175e9) * 12288)  # Scale from GPT-3
    num_layers = int((num_params / 175e9) * 96)      # Scale from GPT-3
    
    kv_cache_gb = (2 * hidden_size * seq_len * batch_size * 
                   bytes_per_param[precision] * num_layers) / (1024**3)
    
    # Activation memory (rough estimate)
    activation_gb = model_memory_gb * 0.1
    
    total_memory_gb = model_memory_gb + kv_cache_gb + activation_gb
    
    return {
        "model_weights_gb": round(model_memory_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "activations_gb": round(activation_gb, 1),
        "total_memory_gb": round(total_memory_gb, 1)
    }

# Example: GPT-3 scale model
result = calculate_llm_memory(175e9, "fp16", batch_size=1, seq_len=2048)
print(f"GPT-3 Memory Requirements: {result}")
# Output: {'model_weights_gb': 325.0, 'kv_cache_gb': 24.6, 'activations_gb': 32.5, 'total_memory_gb': 382.1}

šŸ”¬ Architectural Innovations & Optimizations

šŸŽÆ Attention Optimizations

  • Flash Attention: Memory-efficient attention computation
  • Multi-Query Attention (MQA): Shared key/value across heads
  • Grouped Query Attention (GQA): Balance between MHA and MQA
  • Sparse Attention: Reduce O(n²) complexity
Impact: 2-8x speedup with minimal quality loss

šŸ”§ Activation Functions

  • SwiGLU: Swish + Gated Linear Unit (PaLM, LLaMA)
  • GeGLU: GELU + Gated Linear Unit (T5)
  • RMSNorm: Root Mean Square Layer Normalization
  • Rotary Position Encoding (RoPE): Better position understanding
Benefit: Better training dynamics and performance

🌊 Parallel Training Strategies

  • Data Parallelism: Distribute batches across GPUs
  • Model Parallelism: Split model layers across devices
  • Pipeline Parallelism: Layer-wise execution pipelining
  • Tensor Parallelism: Split individual operations
Result: Train models larger than single GPU memory

⚔ Flash Attention Implementation Concept

def flash_attention(Q, K, V, block_size=128):
    """
    Memory-efficient attention using block-wise computation
    Reduces memory complexity from O(N²) to O(N)
    """
    N, d = Q.shape
    O = torch.zeros_like(Q)
    l = torch.zeros(N)
    m = torch.full((N,), float('-inf'))
    
    # Process in blocks to fit in fast memory (SRAM)
    for j in range(0, N, block_size):
        # Load blocks of K and V
        K_j = K[j:j+block_size]
        V_j = V[j:j+block_size]
        
        for i in range(0, N, block_size):
            # Load block of Q
            Q_i = Q[i:i+block_size]
            
            # Compute attention scores for this block
            S_ij = Q_i @ K_j.T / math.sqrt(d)
            
            # Online softmax update (numerically stable)
            m_new = torch.maximum(m[i:i+block_size], S_ij.max(dim=1).values)
            l_new = torch.exp(m[i:i+block_size] - m_new) * l[i:i+block_size] + \
                    torch.exp(S_ij - m_new.unsqueeze(1)).sum(dim=1)
            
            # Update output
            O[i:i+block_size] = (O[i:i+block_size] * torch.exp(m[i:i+block_size] - m_new).unsqueeze(1) * 
                                 (l[i:i+block_size] / l_new).unsqueeze(1) + 
                                 torch.exp(S_ij - m_new.unsqueeze(1)) @ V_j) / l_new.unsqueeze(1)
            
            # Update statistics
            m[i:i+block_size] = m_new
            l[i:i+block_size] = l_new
    
    return O

# This enables training much longer sequences with the same memory

šŸ† LLM Architecture Key Takeaways

Scaling Principles:

  • šŸŽÆ Balance parameters, data, and compute (Chinchilla laws)
  • šŸ“ˆ More layers → better reasoning capabilities
  • šŸ” More attention heads → diverse pattern recognition
  • šŸ“ Longer context → better long-form understanding

Optimization Priorities:

  • ⚔ Memory efficiency (Flash Attention, quantization)
  • šŸ”„ Training stability (better norms, activations)
  • šŸš€ Inference speed (KV caching, model parallelism)
  • šŸ’° Cost efficiency (MoE, pruning, distillation)

šŸ’” Future Trends: Mixture of Experts (MoE), multimodal architectures, and specialized hardware (TPUs, neuromorphic chips) are pushing the boundaries of what's possible with LLM architectures.