Lesson 6.4.3 : Reinforcement Learning from Human Feedback (RLHF)


Reinforcement Learning from Human Feedback (RLHF) is a technique used to align large language models (LLMs) with human preferences. Unlike traditional supervised learning, where models are trained on labeled input-output pairs, RLHF uses reinforcement learning (RL) to optimize model behavior based on feedback from humans or a learned reward model. It is a feedback loop in which a model generates output that are stored by another model and fed back in for fine-tuning is defined as RLHF (Renforcement Learning with Human Feedback). RLHF is really more of a framework than it is a prescription for a particular way to train a model. RLHF is generally employed to fine tune and optimize existing models.

Motivation: Researchers wanted to be able to do more with less data so they said let's automate this reward calculation so that we dont need as much human feedback to train the model.

How RLHF Works

  • Pretraining a Base Model:

    • A base LLM (e.g., GPT) is pretrained on a large corpus of text using unsupervised learning.
  • Supervised Fine-Tuning (SFT):

    • The model is fine-tuned on high-quality human-generated responses to improve its initial behavior.
  • Reward Model Training:

    • Humans rank multiple model-generated responses (e.g., "Response A is better than Response B").
    • A separate reward model is trained to predict these human preferences.
  • RL Optimization (PPO):

    • The LLM is fine-tuned using Proximal Policy Optimization (PPO), an RL algorithm, to maximize the reward predicted by the reward model.

Drawback

  • RLHF has the drawback of being unstable , and it needs to train the reward model that is usualy initialized with the copy of the original LLM and is therefore large.

What if we can do without the reward model and without reenfocement learning and use just a cross entrophy losss ?

All systems normal

© 2025 2023 Sanjeeb KC. All rights reserved.