Deliberative alignment: Safer language models are made possible by reasoning
Presenting OpenAI’s novel alignment approach for O-series models, which learn safety criteria and how to comprehend them directly.
OpenAI presents a training paradigm called deliberative alignment, which explicitly trains reasoning LLMs to reason about human-written and interpretable safety criteria before responding. For OpenAI’s o-series models to employ chain-of-thought (CoT) reasoning to consider user prompts, find pertinent language from OpenAI’s internal policies, and create safer responses, it employed deliberative alignment.
OpenAI’s method achieves extremely accurate compliance with its safety guidelines without the need for replies or CoTs that have been labeled by humans. It discovers that o1 saturates performance on numerous difficult datasets and performs noticeably better than GPT-4o and other cutting-edge LLMs on a variety of internal and external safety benchmarks. In its opinion, this offers a promising new avenue for enhancing safety, and it is a positive illustration of how advancements in capabilities can be used to enhance safety as well.
Summary
Even with intensive safety training, contemporary LLMs continue to over-refuse innocent inquiries, respond with harmful prompts, and become vulnerable to jailbreak assaults. The need for models to react rapidly without enough time to process intricate and dubious safety scenarios is one factor contributing to these failures. Another problem is that instead of explicitly learning the fundamental safety requirements in plain language, LLMs have to indirectly infer desirable behavior from enormous collections of labeled samples. This results in poor data efficiency and decision bounds, forcing models to reverse engineer the optimal behavior from instances.
Both of these problems are resolved by deliberate alignment. This method is the first to teach a model the safety standards’ text directly and train it to discuss them during inference time. As a result, reactions are safer and better suited to the situation at hand.
On the other hand, previous alignment methods, such as Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning through AI Feedback, such as Constitutional AI (CAI), simply create training labels based on safety specifications. The model does not receive the specifications themselves. Another distinctive feature of deliberate alignment is its capacity to perform intricate reasoning about safety requirements during inference. While Self-enhance and other strategies enhance responses at inference time, they do not directly reason over learnt safety criteria (as they were not taught) and instead limit the model to specified reasoning paths.
Approach
Both process- and outcome-based supervision are used in deliberate alignment training:
- First, without any safety-relevant data, OpenAI trains an o-style model for helpfulness.
- Next, it creates a dataset of (prompt, completion) pairs in which the requirements are referenced by the CoTs in the completions. To achieve this, it generates model completions, removes the system prompts from the data, and inserts the appropriate safety specification language for each interaction in the system prompt.
- Using this dataset, it apply incremental supervised fine-tuning (SFT), which gives the model a robust prior for safe reasoning. Through SFT, the model gains knowledge of its safety specifications’ content as well as how to analyze them and provide responses that are in line with them.
- The model is then trained to make better use of its CoT via reinforcement learning (RL). It accomplish this by using a reward model that provides an extra reward signal by granting access to its safety policies.
OpenAI’s training process does not require human-labeled completions; instead, it automatically generates training data using safety specifications and safety-categorized prompts. By overcoming a significant issue with regular LLM safety training its excessive reliance on human-labeled data deliberative alignment‘s synthetic data creation pipeline provides a scalable approach to alignment.
Results
It evaluates o1’s security against Claude 3.5 Sonnet, Gemini 1.5 Pro, and GPT-4o using a variety of internal and external safety standards (such as content policy rejections and jailbreaks). The o1 model achieves a Pareto improvement on both under- and over-refusals, and it saturates several of its most rigorous safety studies. This indicates that it is more tolerant of benign cues while also being better at avoiding negative results. Additionally, it discover that significant generalization to out-of-distribution safety events is made possible by safety training that incorporates deliberative alignment.
In conclusion
There are significant concerns associated with improvements in LLM capabilities, like those shown by o1 and o3. The amount of potential harm that AIs could inflict through misalignment or misuse rises significantly as models become more intelligent and autonomous. This emphasizes how urgently further AI safety research is needed. To make sure that as AI systems advance in capability, they continue to be consistent with human values, it is making significant investments in this field, especially in areas like checking thought chains for dishonesty.
Deliberative alignment is the most recent development in its work, and its outcomes greatly inspire us. The method works well to increase specification adherence and resilience to jailbreaks, and it enables us to define the line between safety. it is inspired by the way that improvements in model capabilities can be used to enhance Artificial Intelligence safety through its application to o-series models.