← Previous Home Next →

14.2 Human & AI Feedback Loops

While automated benchmarks are essential for scale, the ultimate measure of an AI's performance is its utility and safety in the real world. This is where feedback loops, involving both humans and other AIs, become critical. These loops are the primary mechanism for collecting the data needed to "align" a model with human values and preferences.

Reinforcement Learning from Human Feedback (RLHF)

RLHF is the most prominent technique used to fine-tune language models to be helpful and harmless. It's a multi-stage process that uses human-provided data to teach a model what constitutes a "good" response.

Supervised Fine-Tuning (SFT): A base pre-trained model is first fine-tuned on a small, high-quality dataset of prompt-response pairs created by human labelers. This teaches the model the basic style and format of a helpful assistant.
Reward Model Training: This is the core of RLHF.
- Human labelers are shown a prompt and two or more different responses from the SFT model.
- They rank the responses from best to worst based on criteria like helpfulness, truthfulness, and safety.
- This comparison data (e.g., "Response A is better than Response B") is used to train a separate "reward model." The reward model's job is to take a prompt and a response and output a scalar score indicating how "good" the response is from a human perspective.
Reinforcement Learning: The SFT model is further fine-tuned using reinforcement learning.
- The model generates a response to a prompt.
- The reward model scores this response.
- This score is used as the "reward" signal to update the language model's parameters using an RL algorithm like Proximal Policy Optimization (PPO).
This process encourages the model to generate responses that the reward model (and thus, humans) would score highly.

AI Feedback (RLAIF) and Judge Models

Collecting human feedback is slow and expensive. A recent innovation is to use a powerful AI model as a substitute for the human labeler. This is sometimes called Reinforcement Learning from AI Feedback (RLAIF).

Constitutional AI: A concept from Anthropic where an AI is given a set of principles or a "constitution" (e.g., "be helpful," "don't be harmful"). The AI then critiques and revises its own responses based on this constitution, generating the preference data needed for RLHF without direct human labeling for every case.
Judge Models: A highly capable model (like GPT-4 or a specialized "evaluator" agent) is used to score or rank the outputs of other models. This is the principle behind benchmarks like MT-Bench. The judge model is given a rubric and asked to act as an impartial evaluator, providing a scalable way to assess model quality.

Feedback loops, whether from humans or AI, are the key to moving beyond raw capability and creating AI systems that are aligned, safe, and genuinely useful.