1.2 Key Concepts: Tokens, Embeddings, Parameters

🎯 Learning Objectives

  • Understand tokenization strategies and their impact on model performance
  • Grasp the concept of embedding spaces and semantic relationships
  • Appreciate parameter scaling laws and computational requirements
  • Recognize the interplay between tokens, embeddings, and model capacity

🧩 Tokens: The Building Blocks of Language Models

What are Tokens?

Tokens are the fundamental units that language models process. They can represent words, subwords, characters, or even bytes depending on the tokenization strategy.

Tokenization Example: "Hello, world!"

Hello , world !

Word-level tokenization (4 tokens)

🔧 Common Tokenization Strategies

1. Byte Pair Encoding (BPE)

Iteratively merges the most frequent pairs of characters or subwords:

# BPE Example Original: "tokenization" Step 1: "token" + "ization" → merge frequent pairs Step 2: "tok" + "en" + "iz" + "ation" Final tokens: ["tok", "en", "iz", "ation"]

2. WordPiece (Used by BERT)

Similar to BPE but uses a greedy approach to build the vocabulary.

3. SentencePiece (Used by T5, LLaMA)

Treats the input as a raw string and doesn't require pre-tokenization.

💡 Try This: Different Tokenization Results

Text: "The unhappiness was overwhelming"

Word-level: [The] [unhappiness] [was] [overwhelming] (4 tokens)

Subword (BPE): [The] [un] [happiness] [was] [over] [whelming] (6 tokens)

Character-level: [T][h][e][ ][u][n]... (30+ tokens)

🎯 Tokenization Impact on Performance

  • Vocabulary Size: Affects model parameters and inference speed
  • Out-of-Vocabulary (OOV): Subword tokenization reduces OOV issues
  • Sequence Length: Influences context window utilization
  • Cross-lingual Performance: Affects multilingual model capabilities

🌐 Embeddings: Mapping Tokens to Vector Space

What are Embeddings?

Embeddings transform discrete tokens into continuous vector representations that capture semantic meaning and relationships.

Embedding Visualization (384-dimensional space, showing 16 values)

0.23
-0.45
0.78
0.12
-0.67
0.34
0.89
-0.23
0.56
-0.12
0.45
0.67
-0.89
0.01
0.78
...

Each token becomes a high-dimensional vector

🧮 Types of Embeddings

1. Token Embeddings

Map each token ID to a learned vector representation.

# Token embedding lookup token_id = 1234 # "hello" embedding = embedding_table[token_id] # Shape: [vocab_size, embedding_dim] # Result: 768-dimensional vector for "hello"

2. Positional Embeddings

Encode the position of tokens in the sequence:

  • Learned: Trainable position embeddings (GPT style)
  • Sinusoidal: Fixed mathematical encoding (Original Transformer)
  • Relative: Encode relative distances between tokens

3. Segment/Type Embeddings

Distinguish between different parts of input (e.g., question vs. context in BERT).

🔍 Semantic Properties

Famous Word Analogy Example:

king - man + woman ≈ queen

This demonstrates that embeddings capture semantic relationships through vector arithmetic.

📊 Embedding Dimensions by Model

  • BERT-base: 768 dimensions
  • GPT-3: 12,288 dimensions
  • LLaMA-7B: 4,096 dimensions
  • Claude-3: ~10,000+ dimensions (estimated)

⚖️ Parameters: The Scale of Modern AI

What are Parameters?

Parameters are the learnable weights and biases in neural networks that are adjusted during training to minimize loss.

🏗️ Parameter Breakdown in Transformers

# Simplified parameter calculation for a transformer layer vocab_size = 50000 embedding_dim = 768 num_layers = 12 num_heads = 12 # Main parameter sources: embedding_params = vocab_size * embedding_dim # ~38M attention_params = embedding_dim * embedding_dim * 4 * num_layers # ~28M feedforward_params = embedding_dim * 4 * embedding_dim * 2 * num_layers # ~56M total_params = embedding_params + attention_params + feedforward_params # BERT-base ≈ 110M parameters

📈 The Parameter Scale Evolution

BERT (2018)
110M params
GPT-2 (2019)
1.5B params
GPT-3 (2020)
175B params
PaLM (2022)
540B params
GPT-4 (2023)
~1.7T params*

📏 Scaling Laws

Research has revealed predictable relationships between model performance and scale:

🔢 Chinchilla Scaling (2022)

For optimal compute efficiency, training tokens should scale proportionally with parameters:

Rule of Thumb: ~20 tokens per parameter

7B parameter model → ~140B training tokens

70B parameter model → ~1.4T training tokens

⚡ Compute Requirements

  • Training: C ≈ 6ND (C=compute, N=parameters, D=dataset tokens)
  • Inference: Linear in parameter count for generation
  • Memory: ~4 bytes per parameter (FP32) or ~2 bytes (FP16)

🎯 Parameter Efficiency Techniques

LoRA (Low-Rank Adaptation)

Fine-tune only small adapter matrices instead of all parameters

Quantization

Reduce precision from FP32 to INT8/INT4 while preserving performance

Pruning

Remove less important weights to reduce model size

Distillation

Train smaller models to mimic larger ones

🔗 How Tokens, Embeddings, and Parameters Work Together

  1. Input Processing: Text → Tokens (via tokenizer)
  2. Embedding Lookup: Token IDs → Dense vectors (via embedding table)
  3. Transformation: Embeddings → Contextualized representations (via transformer layers)
  4. Output Generation: Final representations → Predictions (via output head)

Key Insight: The quality of tokenization affects embedding efficiency, which determines how effectively parameters can capture and manipulate semantic information.