3.2 Instruction Tuning & Alignment

๐ŸŽฏ Learning Objectives

  • Understand the multi-stage LLM training pipeline
  • Learn about pre-training data sources and composition
  • Explore instruction tuning and RLHF processes
  • Analyze data quality, scale, and ethical considerations

๐Ÿญ LLM Training Pipeline

๐Ÿ“‹ Complete Training Process

๐ŸŒฑ Stage 1: Pre-training

Goal: Learn general language understanding from massive text data

  • Data: Trillions of tokens from diverse sources
  • Objective: Next-token prediction (autoregressive)
  • Duration: Weeks to months on thousands of GPUs
  • Output: Base model with broad knowledge
GPT-3 Example: 300B tokens, 96 layers, ~$4.6M compute cost
โฌ‡๏ธ

๐ŸŽฏ Stage 2: Instruction Tuning (SFT)

Goal: Teach the model to follow instructions and behave helpfully

  • Data: High-quality instruction-response pairs
  • Size: 10K-100K examples (much smaller than pre-training)
  • Method: Supervised fine-tuning on human demonstrations
  • Output: Model that can follow instructions
Example: "Explain quantum physics" โ†’ Detailed, helpful explanation
โฌ‡๏ธ

๐ŸŽ–๏ธ Stage 3: RLHF (Reinforcement Learning from Human Feedback)

Goal: Align model outputs with human preferences and values

  • Reward Model: Train a model to predict human preferences
  • PPO Training: Use reinforcement learning to optimize for human-preferred outputs
  • Safety: Reduce harmful, biased, or unhelpful responses
  • Output: Aligned model ready for deployment
Key Innovation: Makes models more helpful, harmless, and honest
โฌ‡๏ธ

๐Ÿš€ Stage 4: Deployment & Monitoring

Goal: Serve the model to users while continuously improving

  • Infrastructure: Scalable serving with load balancing
  • Monitoring: Track performance, safety, and user satisfaction
  • Updates: Regular fine-tuning with new data and feedback
  • Safety: Content filtering and abuse detection

๐Ÿ“š Pre-training Data Sources & Composition

๐Ÿงฉ Typical LLM Training Data Mix

Web Pages (CommonCrawl)
60%
60%
Books & Literature
16%
16%
News Articles
10%
10%
Academic Papers
8%
8%
Code Repositories
4%
4%
Reference & Other
2%
2%

๐ŸŒ Web Content

  • CommonCrawl: Petabyte-scale web scrapes
  • Wikipedia: High-quality encyclopedic content
  • Forums: Reddit, Stack Overflow discussions
  • Quality: Filtered for language, duplicates, toxic content
~1T
Tokens
50+
Languages

๐Ÿ“– Books & Literature

  • BookCorpus: 10K+ books
  • Project Gutenberg: Public domain literature
  • OpenLibrary: Diverse literary works
  • Benefit: Long-form reasoning, narrative understanding
100B+
Tokens
High
Quality

๐Ÿ”ฌ Academic Content

  • ArXiv: Scientific papers and preprints
  • PubMed: Medical literature
  • Academic journals: Peer-reviewed research
  • Value: Technical knowledge, formal reasoning
50B+
Tokens
Expert
Level

๐Ÿ’ป Code & Programming

  • GitHub: Open source repositories
  • GitLab, Bitbucket: Additional code sources
  • Documentation: API docs, tutorials
  • Impact: Programming abilities, structured thinking
25B+
Tokens
100+
Languages

โš ๏ธ Data Quality Challenges

  • Noise: Web content contains errors, spam, low-quality text
  • Bias: Training data reflects societal biases and stereotypes
  • Privacy: Personal information may be inadvertently included
  • Copyright: Legal concerns around use of copyrighted content
  • Duplication: Same content appears multiple times, affecting training

๐ŸŽ–๏ธ RLHF: Reinforcement Learning from Human Feedback

๐Ÿ”„ RLHF Process Flow

1. Collect Comparisons
Humans rank model outputs
โ†’
2. Train Reward Model
Predict human preferences
โ†’
3. PPO Training
Optimize for high rewards

This iterative process aligns the model with human values and preferences

RLHF Component Purpose Data Requirements Key Challenges
Human Annotations Provide preference signals 10K-100K comparisons Consistency, cost, scalability
Reward Model Score outputs by human preference Same base model architecture Reward hacking, generalization
PPO Policy Optimize for reward while staying close to SFT model Continuous interaction with reward model Training stability, KL divergence control
Safety Filtering Prevent harmful outputs Adversarial prompts, red team data Balancing helpfulness vs safety

๐Ÿค– RLHF Pseudocode

class RLHFTrainer:
    def __init__(self, base_model, reward_model):
        self.policy = base_model.copy()
        self.reference_model = base_model.copy()  # Frozen
        self.reward_model = reward_model
        self.ppo_optimizer = PPO()
    
    def train_step(self, prompts):
        # Generate responses from current policy
        responses = self.policy.generate(prompts)
        
        # Get rewards from reward model
        rewards = self.reward_model.score(prompts, responses)
        
        # Calculate KL penalty (stay close to reference model)
        ref_logprobs = self.reference_model.logprobs(prompts, responses)
        policy_logprobs = self.policy.logprobs(prompts, responses)
        kl_penalty = (policy_logprobs - ref_logprobs).mean()
        
        # Combine rewards with KL penalty
        adjusted_rewards = rewards - self.kl_coeff * kl_penalty
        
        # PPO update
        policy_loss = self.ppo_optimizer.compute_loss(
            prompts, responses, adjusted_rewards, policy_logprobs
        )
        
        # Update policy
        policy_loss.backward()
        self.ppo_optimizer.step()
        
        return {
            "reward": rewards.mean(),
            "kl_div": kl_penalty,
            "policy_loss": policy_loss
        }

# Training loop
for epoch in range(num_epochs):
    for batch_prompts in dataloader:
        metrics = trainer.train_step(batch_prompts)
        log_metrics(metrics)

๐Ÿงน Data Processing & Quality Control

๐Ÿ” Content Filtering

  • Language Detection: Filter non-target languages
  • Quality Scoring: Remove low-quality, spam content
  • Toxicity Detection: Filter harmful, offensive content
  • Privacy Scrubbing: Remove PII, sensitive data
Tools: Classifiers, regex patterns, blocklists

๐Ÿ“Š Deduplication

  • Exact Matching: Remove identical documents
  • Near-Duplicate Detection: Fuzzy matching algorithms
  • Sentence-Level: Remove repeated sentences
  • Impact: Prevents memorization, improves generalization
Techniques: MinHash, LSH, Jaccard similarity

๐Ÿ”ค Tokenization

  • Subword Tokenization: BPE, SentencePiece
  • Vocabulary Size: 32K-100K tokens
  • Special Tokens: <start>, <end>, <unk>
  • Efficiency: Balance compression vs interpretability
Goal: ~3-4 characters per token for English

โš–๏ธ Ethical Considerations

  • Consent: Use data with appropriate permissions
  • Bias Mitigation: Balance representation across groups
  • Copyright Respect: Avoid unauthorized copyrighted content
  • Transparency: Document data sources and processing
Standards: Data governance frameworks, ethical AI guidelines

๐Ÿงผ Data Processing Pipeline Example

import re
from datasets import Dataset
from transformers import AutoTokenizer

class DataProcessor:
    def __init__(self, tokenizer_name="gpt2"):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
        self.min_length = 100  # Minimum document length
        self.max_length = 2048  # Maximum sequence length
        
    def filter_quality(self, text):
        """Filter low-quality content"""
        # Basic quality checks
        if len(text) < self.min_length:
            return False
        
        # Check for reasonable word/character ratio
        words = text.split()
        if len(text) / len(words) < 3:  # Too many short words
            return False
            
        # Check for excessive repetition
        lines = text.split('\n')
        unique_lines = set(lines)
        if len(unique_lines) / len(lines) < 0.3:  # Too repetitive
            return False
            
        return True
    
    def clean_text(self, text):
        """Basic text cleaning"""
        # Remove excessive whitespace
        text = re.sub(r'\s+', ' ', text)
        
        # Remove very long lines (likely formatting artifacts)
        lines = text.split('\n')
        cleaned_lines = [line for line in lines if len(line) < 1000]
        
        return '\n'.join(cleaned_lines).strip()
    
    def tokenize_and_chunk(self, text):
        """Tokenize and create fixed-length chunks"""
        tokens = self.tokenizer.encode(text)
        
        # Create overlapping chunks
        chunks = []
        stride = self.max_length // 2  # 50% overlap
        
        for i in range(0, len(tokens), stride):
            chunk = tokens[i:i + self.max_length]
            if len(chunk) >= self.min_length:
                chunks.append(chunk)
                
        return chunks
    
    def process_dataset(self, raw_texts):
        """Process a dataset of raw texts"""
        processed_chunks = []
        
        for text in raw_texts:
            # Quality filtering
            if not self.filter_quality(text):
                continue
                
            # Text cleaning
            cleaned_text = self.clean_text(text)
            
            # Tokenization and chunking
            chunks = self.tokenize_and_chunk(cleaned_text)
            processed_chunks.extend(chunks)
            
        return processed_chunks

# Usage example
processor = DataProcessor()
training_data = processor.process_dataset(raw_documents)
print(f"Processed {len(training_data)} training chunks")

๐Ÿ† LLM Training Best Practices

Data Quality:

  • ๐Ÿ” Rigorous filtering and deduplication
  • ๐Ÿ“Š Balanced representation across domains
  • ๐Ÿงน Consistent preprocessing and tokenization
  • โš–๏ธ Ethical sourcing and bias consideration

Training Process:

  • ๐ŸŽฏ Clear objectives for each training stage
  • ๐Ÿ“ˆ Careful scaling of data, model, and compute
  • ๐Ÿ”„ Iterative refinement with human feedback
  • ๐Ÿ›ก๏ธ Safety and alignment throughout process

๐Ÿ’ก Key Insight: The quality of training data is often more important than quantity. Modern LLMs succeed through careful curation of diverse, high-quality datasets combined with sophisticated training techniques like RLHF.