← Previous Contents Next →

3.1 Architecture Recap & Scaling Laws

        🎯 Learning Objectives
        Understand the transformer architecture that powers modern LLMs
Explore scaling laws and parameter count implications
Learn computational requirements for training and inference
Analyze the relationship between model size and capabilities

    

🏗️ Transformer Architecture Deep Dive

LLM Architecture Stack

Output Layer
Vocabulary projection + Softmax

Transformer Layers (N×)
Typically 12-96+ layers

Multi-Head Attention
Self-attention mechanism

Feed Forward Network
2-layer MLP with activation

Token + Position Embeddings
Input representation

Note: Each transformer layer includes residual connections, layer normalization, and dropout for training stability.

🔍 Multi-Head Attention Example

For a 12-layer model with 12 attention heads per layer:

Head 1

Head 2

Head 3

...

Head 12

Each head learns different linguistic patterns: syntax, semantics, long-range dependencies

🧮 Attention Mechanism (Simplified)

def multi_head_attention(query, key, value, num_heads):
    """
    Multi-head self-attention mechanism
    Args:
        query, key, value: Input tensors [batch, seq_len, d_model]
        num_heads: Number of attention heads
    """
    batch_size, seq_len, d_model = query.shape
    head_dim = d_model // num_heads
    
    # Split into multiple heads
    q = query.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    k = key.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    v = value.view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2)
    
    # Scaled dot-product attention
    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(head_dim)
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, v)
    
    # Concatenate heads
    output = output.transpose(1, 2).contiguous().view(
        batch_size, seq_len, d_model
    )
    return output

# Example usage for GPT-4 scale
# - 96 layers
# - 128 attention heads per layer  
# - 12,288 dimensional embeddings
# - 2 trillion parameters total

📈 Scaling Laws & Parameter Growth

🔢 Model Scale Evolution (2018-2024)

BERT-Base (2018)

110M

GPT-2 (2019)

1.5B

GPT-3 (2020)

175B

PaLM (2022)

540B

GPT-4 (2023)

~1.8T

Model Component	GPT-3 (175B)	PaLM (540B)	GPT-4 (~1.8T)	Impact on Capabilities
Layers	96	118	~120	Deeper reasoning chains
Hidden Size	12,288	18,432	~20,480	Richer representations
Attention Heads	96	48	~128	More diverse attention patterns
Context Length	2,048	2,048	8,192-128K	Long-form understanding
Vocab Size	50,257	256,000	~100,000	Better tokenization efficiency

🧠 Chinchilla Scaling Laws (Hoffmann et al., 2022)

Key Insight: Optimal model performance requires balanced scaling of parameters and training data.

Parameter-Data Ratio: ~20 tokens per parameter for optimal training
Compute Scaling: C ∝ N^1.34 × D^1.34 (N=params, D=data, C=compute)
Implication: Many large models are undertrained, not oversized

⚡ Computational Requirements & Hardware

🏋️ Training Requirements

GPT-3 (175B):	~3.14 × 10²³ FLOPs
Training Time:	~34 days on 1024 A100s
Energy Cost:	~1,287 MWh
Estimated Cost:	$4.6M (compute only)

Note: GPT-4 training estimated at 10-100x these costs

🚀 Inference Requirements

Memory (FP16):	~350GB for 175B model
GPU Requirements:	8×A100 (80GB) minimum
Throughput:	~20-50 tokens/sec
Cost per 1M tokens:	$1-20 depending on provider

Optimization: Quantization can reduce memory by 2-4x

💾 Memory Breakdown

Model Weights:	70-80%
KV Cache:	10-20%
Activations:	5-10%
Framework Overhead:	2-5%

KV Cache: Grows linearly with context length

🧮 Memory Calculator Example

def calculate_llm_memory(num_params, precision="fp16", batch_size=1, seq_len=2048):
    """
    Calculate LLM memory requirements
    """
    # Model weights
    bytes_per_param = {"fp32": 4, "fp16": 2, "int8": 1, "int4": 0.5}
    model_memory_gb = (num_params * bytes_per_param[precision]) / (1024**3)
    
    # KV cache (approximate)
    # For each layer: 2 * hidden_size * seq_len * batch_size * precision
    hidden_size = int((num_params / 175e9) * 12288)  # Scale from GPT-3
    num_layers = int((num_params / 175e9) * 96)      # Scale from GPT-3
    
    kv_cache_gb = (2 * hidden_size * seq_len * batch_size * 
                   bytes_per_param[precision] * num_layers) / (1024**3)
    
    # Activation memory (rough estimate)
    activation_gb = model_memory_gb * 0.1
    
    total_memory_gb = model_memory_gb + kv_cache_gb + activation_gb
    
    return {
        "model_weights_gb": round(model_memory_gb, 1),
        "kv_cache_gb": round(kv_cache_gb, 1),
        "activations_gb": round(activation_gb, 1),
        "total_memory_gb": round(total_memory_gb, 1)
    }

# Example: GPT-3 scale model
result = calculate_llm_memory(175e9, "fp16", batch_size=1, seq_len=2048)
print(f"GPT-3 Memory Requirements: {result}")
# Output: {'model_weights_gb': 325.0, 'kv_cache_gb': 24.6, 'activations_gb': 32.5, 'total_memory_gb': 382.1}

🔬 Architectural Innovations & Optimizations

🎯 Attention Optimizations

Flash Attention: Memory-efficient attention computation
Multi-Query Attention (MQA): Shared key/value across heads
Grouped Query Attention (GQA): Balance between MHA and MQA
Sparse Attention: Reduce O(n²) complexity

Impact: 2-8x speedup with minimal quality loss

🔧 Activation Functions

SwiGLU: Swish + Gated Linear Unit (PaLM, LLaMA)
GeGLU: GELU + Gated Linear Unit (T5)
RMSNorm: Root Mean Square Layer Normalization
Rotary Position Encoding (RoPE): Better position understanding

Benefit: Better training dynamics and performance

🌊 Parallel Training Strategies

Data Parallelism: Distribute batches across GPUs
Model Parallelism: Split model layers across devices
Pipeline Parallelism: Layer-wise execution pipelining
Tensor Parallelism: Split individual operations

Result: Train models larger than single GPU memory

⚡ Flash Attention Implementation Concept

def flash_attention(Q, K, V, block_size=128):
    """
    Memory-efficient attention using block-wise computation
    Reduces memory complexity from O(N²) to O(N)
    """
    N, d = Q.shape
    O = torch.zeros_like(Q)
    l = torch.zeros(N)
    m = torch.full((N,), float('-inf'))
    
    # Process in blocks to fit in fast memory (SRAM)
    for j in range(0, N, block_size):
        # Load blocks of K and V
        K_j = K[j:j+block_size]
        V_j = V[j:j+block_size]
        
        for i in range(0, N, block_size):
            # Load block of Q
            Q_i = Q[i:i+block_size]
            
            # Compute attention scores for this block
            S_ij = Q_i @ K_j.T / math.sqrt(d)
            
            # Online softmax update (numerically stable)
            m_new = torch.maximum(m[i:i+block_size], S_ij.max(dim=1).values)
            l_new = torch.exp(m[i:i+block_size] - m_new) * l[i:i+block_size] + \
                    torch.exp(S_ij - m_new.unsqueeze(1)).sum(dim=1)
            
            # Update output
            O[i:i+block_size] = (O[i:i+block_size] * torch.exp(m[i:i+block_size] - m_new).unsqueeze(1) * 
                                 (l[i:i+block_size] / l_new).unsqueeze(1) + 
                                 torch.exp(S_ij - m_new.unsqueeze(1)) @ V_j) / l_new.unsqueeze(1)
            
            # Update statistics
            m[i:i+block_size] = m_new
            l[i:i+block_size] = l_new
    
    return O

# This enables training much longer sequences with the same memory

🏆 LLM Architecture Key Takeaways

Scaling Principles:

🎯 Balance parameters, data, and compute (Chinchilla laws)
📈 More layers → better reasoning capabilities
🔍 More attention heads → diverse pattern recognition
📏 Longer context → better long-form understanding

Optimization Priorities:

⚡ Memory efficiency (Flash Attention, quantization)
🔄 Training stability (better norms, activations)
🚀 Inference speed (KV caching, model parallelism)
💰 Cost efficiency (MoE, pruning, distillation)

💡 Future Trends: Mixture of Experts (MoE), multimodal architectures, and specialized hardware (TPUs, neuromorphic chips) are pushing the boundaries of what's possible with LLM architectures.

← Previous: When to Choose SLMs Next: Instruction Tuning & Alignment →

← Previous Contents Next →