Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) is a training loop that fine-tunes a pretrained language or vision model by optimizing it against reward signals derived from human preferences. Annotators rank multiple model outputs for the same prompt; a reward model learns these rankings and scores new outputs. The base model then undergoes reinforcement learning—often Proximal Policy Optimization (PPO)—to maximize the learned reward while constraining divergence from its original parameters with K-L penalties. RLHF aligns generative AI with human values, tone, and safety guidelines, reducing toxic or nonsensical responses without hand-coding rules. It powers models like GPT-4 and Gemini, boosting helpfulness and factuality scores on benchmarks such as HELM and MT-Bench. Key challenges include annotator bias, reward hacking, and high compute cost, mitigated by diverse rater pools, iterative audits, and off-policy evaluation.