RLHF uses direct human input to build a “reward model” and optimise an artificial intelligence agent’s performance through reinforcement learning.
RLHF Meaning
RLHF, or reinforcement learning from human feedback, is ideal for complex, ill-defined, or hard-to-specify goals. For instance, an algorithmic approach cannot define “funny” mathematically, but humans can judge jokes generated by a big language model. Human feedback, converted into an incentive function, could improve LLM humour writing.
Paul F.Christiano of OpenAI and other researchers from OpenAI and DeepMind described RLHF’s performance in training AI models to play Atari games and replicate robotic locomotion in a 2017 publication. After this accomplishment, RLHF-trained AI systems like OpenAI Five and DeepMind’s AlphaStar defeated top human professional players in Dota 2 and StarCraft, respectively, in 2019.
Perhaps most notably, OpenAI’s 2017 article observed that its technique, particularly the proximal policy optimisation (PPO) algorithm for updating model weights, greatly lowered the cost of obtaining and distilling human feedback. This allowed RLHF to be integrated with NLP, which advanced LLMs and RLHF into the forefront of AI research.
OpenAI released the initial code detailing the usageRLHF-trained InstructGPT was released by OpenAI in early 2022.5, after the original code explaining its use on language models was released in 2019. This was crucial to connecting GPT-3 to ChatGPT’s GPT-3.5-turbo models. of RLHF on language models in 2019, and in early 2022.5, they released the RLHF-trained InstructGPT. This was a critical step in bridging the gap between GPT-3 and the GPT-3.5-turbo models that powered ChatGPT.
RLHF has trained cutting-edge LLMs from OpenAI, DeepMind, Google, and Anthropic.
How reinforcement learning works
Reinforcement learning (RL) mimics human learning by motivating AI agents to learn holistically through trial and error.
A reinforcement learning mathematical framework includes the following to implement that strategy:
State space
The state space contains all task-related information essential to AI agent decisions, including known and unknown variables. Each agent decision modifies the state space.
Action space
All AI agent decisions are in the action space. The action area in a board game is distinct and well-defined: it contains all authorised movements possible to the AI player. The action space for text generation is huge, encompassing an LLM’s whole token “vocabulary”.
Reward function
AI agents are motivated by reward. Winning board games is a clear definition of success. However, developing an effective reward mechanism when “success” is ambiguous is difficult. Quantifying positive (or negative) feedback as a scalar reward signal is necessary in mathematics.
Constraints
Reward functions can be reinforced by negative rewards for counterproductive actions. A chatbot may be banned from spouting profanity, and a self-driving car model may be fined for crashes or lane drift.
Policy
Policy is an AI agent’s strategy or “thought process” that guides its behaviour. A policy (“π”) is a mathematical function that requires a state (“s”) and returns an action (“a”): π(s)→a.
RL algorithms optimise policies for highest reward. Deep reinforcement learning updates the policy as a neural network based on the reward function during training. AI agents learn from experience like humans.
Conventional RL has shown promising outcomes in many disciplines, but it can struggle to create a reward function for complex tasks with a fuzzy notion of success. RLHF’s main benefit is using positive human feedback instead of objectives to capture nuance and subjectivity.
LLM RLHF
RLHF’s most notable use is improving LLM relevance, accuracy, and ethics for chatbots.
LLMs, like all generative AI models, reproduce training data probability distributions. Despite recent developments in leveraging LLMs as chatbot engines or reasoning engines for general-purpose AI, these language models simply exploit patterns learnt from their training data to anticipate the next word(s) in a prompt-initiated sequence. These models append text to prompts, not answer them.
Language models cannot comprehend user intent without clear instructions. Prompt engineering can help an LLM understand a user’s needs, but it’s impossible to mandate it for every chatbot interaction.
While standard methods have been used to train LLMs to produce grammatically coherent output, training them to produce “good” output is difficult. Truth, helpfulness, inventiveness, and code snippet executability are more context-dependent than word meanings and language structure.
Data scientists used reinforcement learning with human feedback to improve language models for human interaction. In obeying instructions, retaining factual correctness, and avoiding model hallucinations, RLHF-enhanced InstructGPT models outperformed GPT-3. After GPT-4 launched, OpenAI found that RLHF quadrupled adversarial question accuracy.
RLHF can outperform bigger training datasets, enabling for more data-efficient model development: OpenAI’s labellers favoured 1.3B-parameter InstructGPT outputs over 175B-parameter GPT-3 outputs.
How does RLHF work?
Four steps are usual for RLHF LLM training:
Pre-training models
RLHF is usually used to optimise a pre-trained model, not train it. RLHF was utilised by InstructGPT to improve the Generative Pre-trained Transformer model. OpenAI claimed in its InstructGPT release announcement that “one way of thinking about this process is that it ‘unlocks’ capabilities that GPT-3 already had, but were difficult to elicit through prompt engineering alone.”
Pre-training is the most resource-intensive RLHF phase. OpenAI reported that InstructGPT’s RLHF training required less than 2% of GPT-3’s pre-training computation and data.
Supervised fine-tuning
Before explicit reinforcement learning, supervised fine-tuning (SFT) primes the model to provide user-expected answers.
As mentioned, LLM pre-training optimises models for completion by repeating linguistic patterns learnt during model pre-training to anticipate the next words in a sequence started by the user’s request. If a user asks, “teach me how to make a resume,” the LLM may respond, “using Microsoft Word.” This is an acceptable way to complete the phrase, but it doesn’t meet the user’s purpose.
Therefore, SFT trains models to respond appropriately to varied prompts using supervised learning. Human specialists create labelled examples (prompt, response) to show how to answer questions, summarise, and translate prompts.
Powerful demonstrative data takes time and money to develop. Deep Mind used “applying a filtering heuristic based on a common written dialogue format (‘interview transcript’ style)” to find relevant prompt/response example pairings in their MassiveWeb dataset instead of creating new ones.
Train reward models
A reward model is needed to translate human preference into a numerical reward signal for reinforcement learning. Since there is no simple mathematical or logical formula to define subjective human values, designing an effective reward model is crucial.
The reward model needs enough training data, including direct feedback from human evaluators, to learn how human preferences allocate rewards to different model responses. This allows training to continue offline without a human.
To be integrated with other components of the RL algorithm, a reward model must input a sequence of text and output a scalar reward value that numerically predicts how much a human user would reward (or penalise) that text.
While it may seem intuitive to have human evaluators rate each model response on a scale of one (worst) to ten (best), it’s impossible to get all human raters aligned on the relative value of a given score, let alone what constitutes a “good” or “bad” response in a vacuum. This can make direct scalar rating noisy and difficult to calibrate.
Instead, a rating system is usually built by comparing human feedback for different model outputs. For example, users compare two analogous text sequences, such as the output of two different language models responding to the same prompt in head-to-head matchups, then use an Elo rating system to rank each bit of generated text.
Ranking systems’ results are normalised into a scalar reward signal for reward model training.
Policy optimisation
The last obstacle of RLHF is deciding how and how much to use the reward model to change the AI agent’s policy. Proximal policy optimisation is a successful reward function algorithm.
Most machine learning and neural network model topologies utilise gradient descent to minimise loss function and error, however reinforcement learning algorithms use gradient ascent to maximise reward.
If the reward function is used to train the LLM without guardrails, the language model may dramatically change its weights to the point of outputting gibberish to “game” the reward model. PPO limits how much the AI agent’s policy can be updated in each training iteration, making it more stable.
After creating a copy of the initial model and freezing its trainable weights, the PPO algorithm calculates a range of [1-ε, 1+ε], where ε is a hyper parameter determining the deviation of the updated policy from the old one. It then calculates a probability ratio, comparing the probability of a given action taken by the old and new policies.
PPO offered a simple and cost-effective alternative to trust region policy optimisation (TRPO), which offers similar benefits but is more complicated and computationally expensive. Other policy optimisation frameworks like advantage actor-critic (A2C) are also viable.
Limitations of RLHF
RLHF models have shown promising results in training AI agents for complicated tasks like robotics, video games, and NLP, but they have limits.
The need to gather firsthand human preference data can be expensive and limit the scalability of the RLHF process. Anthropic and Google have proposed reinforcement learning from AI feedback (RLAIF), replacing some or all human feedback with another LLM evaluating model responses that have yielded results comparable to RLHF.
Human input is highly subjective, making it difficult, if not impossible, to establish firm consensus on what constitutes “high-quality” output. Human annotators often disagree on alleged facts and what “appropriate” model behaviour should mean. This prevents the realisation of a genuine “ground truth” against which model performance can be judged.
Wolf, et al. argued in 2016 that toxic behaviour should be a fundamental expectation of human-bot interactions and suggested a method to assess the credibility of human input. In 2022, Meta AI published a paper on
If human feedback comes from a small demography, the model may perform poorly when utilised by various groups or pushed on topics for which the human evaluators have prejudices.