1.2 Key Concepts: Tokens, Embeddings, Parameters
🎯 Learning Objectives
- Understand tokenization strategies and their impact on model performance
- Grasp the concept of embedding spaces and semantic relationships
- Appreciate parameter scaling laws and computational requirements
- Recognize the interplay between tokens, embeddings, and model capacity
🧩 Tokens: The Building Blocks of Language Models
What are Tokens?
Tokens are the fundamental units that language models process. They can represent words, subwords, characters, or even bytes depending on the tokenization strategy.
Tokenization Example: "Hello, world!"
Word-level tokenization (4 tokens)
🔧 Common Tokenization Strategies
1. Byte Pair Encoding (BPE)
Iteratively merges the most frequent pairs of characters or subwords:
2. WordPiece (Used by BERT)
Similar to BPE but uses a greedy approach to build the vocabulary.
3. SentencePiece (Used by T5, LLaMA)
Treats the input as a raw string and doesn't require pre-tokenization.
💡 Try This: Different Tokenization Results
Text: "The unhappiness was overwhelming"
Word-level: [The] [unhappiness] [was] [overwhelming] (4 tokens)
Subword (BPE): [The] [un] [happiness] [was] [over] [whelming] (6 tokens)
Character-level: [T][h][e][ ][u][n]... (30+ tokens)
🎯 Tokenization Impact on Performance
- Vocabulary Size: Affects model parameters and inference speed
- Out-of-Vocabulary (OOV): Subword tokenization reduces OOV issues
- Sequence Length: Influences context window utilization
- Cross-lingual Performance: Affects multilingual model capabilities
🌐 Embeddings: Mapping Tokens to Vector Space
What are Embeddings?
Embeddings transform discrete tokens into continuous vector representations that capture semantic meaning and relationships.
Embedding Visualization (384-dimensional space, showing 16 values)
Each token becomes a high-dimensional vector
🧮 Types of Embeddings
1. Token Embeddings
Map each token ID to a learned vector representation.
2. Positional Embeddings
Encode the position of tokens in the sequence:
- Learned: Trainable position embeddings (GPT style)
- Sinusoidal: Fixed mathematical encoding (Original Transformer)
- Relative: Encode relative distances between tokens
3. Segment/Type Embeddings
Distinguish between different parts of input (e.g., question vs. context in BERT).
🔍 Semantic Properties
Famous Word Analogy Example:
king - man + woman ≈ queen
This demonstrates that embeddings capture semantic relationships through vector arithmetic.
📊 Embedding Dimensions by Model
- BERT-base: 768 dimensions
- GPT-3: 12,288 dimensions
- LLaMA-7B: 4,096 dimensions
- Claude-3: ~10,000+ dimensions (estimated)
⚖️ Parameters: The Scale of Modern AI
What are Parameters?
Parameters are the learnable weights and biases in neural networks that are adjusted during training to minimize loss.
🏗️ Parameter Breakdown in Transformers
📈 The Parameter Scale Evolution
📏 Scaling Laws
Research has revealed predictable relationships between model performance and scale:
🔢 Chinchilla Scaling (2022)
For optimal compute efficiency, training tokens should scale proportionally with parameters:
Rule of Thumb: ~20 tokens per parameter
7B parameter model → ~140B training tokens
70B parameter model → ~1.4T training tokens
⚡ Compute Requirements
- Training: C ≈ 6ND (C=compute, N=parameters, D=dataset tokens)
- Inference: Linear in parameter count for generation
- Memory: ~4 bytes per parameter (FP32) or ~2 bytes (FP16)
🎯 Parameter Efficiency Techniques
LoRA (Low-Rank Adaptation)
Fine-tune only small adapter matrices instead of all parameters
Quantization
Reduce precision from FP32 to INT8/INT4 while preserving performance
Pruning
Remove less important weights to reduce model size
Distillation
Train smaller models to mimic larger ones
🔗 How Tokens, Embeddings, and Parameters Work Together
- Input Processing: Text → Tokens (via tokenizer)
- Embedding Lookup: Token IDs → Dense vectors (via embedding table)
- Transformation: Embeddings → Contextualized representations (via transformer layers)
- Output Generation: Final representations → Predictions (via output head)
Key Insight: The quality of tokenization affects embedding efficiency, which determines how effectively parameters can capture and manipulate semantic information.