← Previous Contents Next →

1.2 Key Concepts: Tokens, Embeddings, Parameters

        🎯 Learning Objectives
        Understand tokenization strategies and their impact on model performance
Grasp the concept of embedding spaces and semantic relationships
Appreciate parameter scaling laws and computational requirements
Recognize the interplay between tokens, embeddings, and model capacity

    

🧩 Tokens: The Building Blocks of Language Models

What are Tokens?

Tokens are the fundamental units that language models process. They can represent words, subwords, characters, or even bytes depending on the tokenization strategy.

Tokenization Example: "Hello, world!"

Hello , world !

Word-level tokenization (4 tokens)

🔧 Common Tokenization Strategies

1. Byte Pair Encoding (BPE)

Iteratively merges the most frequent pairs of characters or subwords:

# BPE Example
Original: "tokenization"
Step 1: "token" + "ization" → merge frequent pairs
Step 2: "tok" + "en" + "iz" + "ation"
Final tokens: ["tok", "en", "iz", "ation"]
        

2. WordPiece (Used by BERT)

Similar to BPE but uses a greedy approach to build the vocabulary.

3. SentencePiece (Used by T5, LLaMA)

Treats the input as a raw string and doesn't require pre-tokenization.

💡 Try This: Different Tokenization Results

Text: "The unhappiness was overwhelming"

Word-level: [The] [unhappiness] [was] [overwhelming] (4 tokens)

Subword (BPE): [The] [un] [happiness] [was] [over] [whelming] (6 tokens)

Character-level: [T][h][e][ ][u][n]... (30+ tokens)

🎯 Tokenization Impact on Performance

Vocabulary Size: Affects model parameters and inference speed
Out-of-Vocabulary (OOV): Subword tokenization reduces OOV issues
Sequence Length: Influences context window utilization
Cross-lingual Performance: Affects multilingual model capabilities

🌐 Embeddings: Mapping Tokens to Vector Space

What are Embeddings?

Embeddings transform discrete tokens into continuous vector representations that capture semantic meaning and relationships.

Embedding Visualization (384-dimensional space, showing 16 values)

0.23

-0.45

0.78

0.12

-0.67

0.34

0.89

-0.23

0.56

-0.12

0.45

0.67

-0.89

0.01

0.78

...

Each token becomes a high-dimensional vector

🧮 Types of Embeddings

1. Token Embeddings

Map each token ID to a learned vector representation.

# Token embedding lookup
token_id = 1234  # "hello"
embedding = embedding_table[token_id]  # Shape: [vocab_size, embedding_dim]
# Result: 768-dimensional vector for "hello"
        

2. Positional Embeddings

Encode the position of tokens in the sequence:

Learned: Trainable position embeddings (GPT style)
Sinusoidal: Fixed mathematical encoding (Original Transformer)
Relative: Encode relative distances between tokens

3. Segment/Type Embeddings

Distinguish between different parts of input (e.g., question vs. context in BERT).

🔍 Semantic Properties

Famous Word Analogy Example:

king - man + woman ≈ queen

This demonstrates that embeddings capture semantic relationships through vector arithmetic.

📊 Embedding Dimensions by Model

BERT-base: 768 dimensions
GPT-3: 12,288 dimensions
LLaMA-7B: 4,096 dimensions
Claude-3: ~10,000+ dimensions (estimated)

⚖️ Parameters: The Scale of Modern AI

What are Parameters?

Parameters are the learnable weights and biases in neural networks that are adjusted during training to minimize loss.

🏗️ Parameter Breakdown in Transformers

# Simplified parameter calculation for a transformer layer
vocab_size = 50000
embedding_dim = 768
num_layers = 12
num_heads = 12

# Main parameter sources:
embedding_params = vocab_size * embedding_dim  # ~38M
attention_params = embedding_dim * embedding_dim * 4 * num_layers  # ~28M  
feedforward_params = embedding_dim * 4 * embedding_dim * 2 * num_layers  # ~56M

total_params = embedding_params + attention_params + feedforward_params
# BERT-base ≈ 110M parameters
        

📈 The Parameter Scale Evolution

BERT (2018)

110M params

GPT-2 (2019)

1.5B params

GPT-3 (2020)

175B params

PaLM (2022)

540B params

GPT-4 (2023)

~1.7T params*

📏 Scaling Laws

Research has revealed predictable relationships between model performance and scale:

🔢 Chinchilla Scaling (2022)

For optimal compute efficiency, training tokens should scale proportionally with parameters:

Rule of Thumb: ~20 tokens per parameter

7B parameter model → ~140B training tokens

70B parameter model → ~1.4T training tokens

⚡ Compute Requirements

Training: C ≈ 6ND (C=compute, N=parameters, D=dataset tokens)
Inference: Linear in parameter count for generation
Memory: ~4 bytes per parameter (FP32) or ~2 bytes (FP16)

🎯 Parameter Efficiency Techniques

LoRA (Low-Rank Adaptation)

Fine-tune only small adapter matrices instead of all parameters

Quantization

Reduce precision from FP32 to INT8/INT4 while preserving performance

Pruning

Remove less important weights to reduce model size

Distillation

Train smaller models to mimic larger ones

🔗 How Tokens, Embeddings, and Parameters Work Together

Input Processing: Text → Tokens (via tokenizer)
Embedding Lookup: Token IDs → Dense vectors (via embedding table)
Transformation: Embeddings → Contextualized representations (via transformer layers)
Output Generation: Final representations → Predictions (via output head)

Key Insight: The quality of tokenization affects embedding efficiency, which determines how effectively parameters can capture and manipulate semantic information.

← Previous: AI Evolution Next: Model Taxonomy →

← Previous Contents Next →