With the introduction of Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning today, Microsoft Azure is Introduced in a new era for compact language models and redefine what is feasible with small and effective Artificial Intelligence.
Reasoning models, the next step forward
Inference-time scaling is used to train reasoning models to execute difficult tasks requiring internal reflection and multi-step breakdown. They are becoming the foundation of agentic systems with intricate, multifaceted duties and excel in mathematical reasoning. Usually, only huge frontier models have such capabilities. A new class of small language models is introduced by phi-reasoning models. These models strike a compromise between size and performance by utilizing distillation, reinforcement learning, and high-quality data. Despite being sufficiently small for low-latency settings, they are capable of reasoning just as well as much larger models. Even devices with minimal resources may effectively complete complicated reasoning tasks because to this combination.
Phi-4-reasoning and Phi-4-reasoning-plus
On complicated reasoning problems, the 14-billion parameter open-weight reasoning model Phi-4-reasoning can compete with considerably Larger models. Phi-4-reasoning, which is trained via supervised fine-tuning of Phi-4 on carefully selected reasoning examples from OpenAI o3-mini, produces intricate reasoning chains that efficiently make use of extra inference-time computation. The model shows that smaller models may compete with bigger models when careful data curation and high-quality synthetic datasets are used.
Building on Phi-4-reasoning’s capabilities, Phi-4-reasoning-plus uses 1.5 times as many tokens as Phi-4-reasoning and is further trained with reinforcement learning to use more inference-time computation to achieve greater accuracy.
Both models outperform OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on the majority of benchmarks, including mathematical reasoning and Ph.D. level scientific problems, despite their much smaller size. They outperform the complete DeepSeek-R1 model (which has 671 billion parameters) on the 2025 USA Math Olympiad qualifier, the AIME 2025 test.
In comparison to Phi-4, Phi-4-reasoning models offer a significant improvement, outperform bigger models such as DeepSeek-R1-Distill-70B, and come close to DeepSeek-R1 in a variety of reasoning and general capabilities, such as math, coding, algorithmic problem solving, and planning. Through a variety of reasoning exercises, the technical report offers substantial quantitative proof of these advances.
Phi-4-mini-reasoning
The purpose of Phi-4-mini-reasoning is to satisfy the need for a condensed reasoning model. In settings with limited processing power or latency, this transformer-based language model offers excellent, methodical problem solving that is tailored for mathematical reasoning. Phi-4-mini-reasoning, which has been refined using artificial data produced by the Deepseek-R1 model, strikes a compromise between effectiveness and sophisticated reasoning skills. It is trained on more than one million different arithmetic problems of varying complexity levels, from middle school to Ph.D. level, and is perfect for educational applications, embedded tutoring, and lightweight deployment on edge or mobile systems.
Phi reasoning models in action
Over the past year, Phi’s development has consistently pushed the boundaries of quality vs. size, adding new features to the family to meet a range of demands. These models can operate locally on CPUs and GPUs throughout the range of Windows 11 devices.
With the NPU-optimized Phi Silica version, Phi models have become an essential component of Copilot+ PCs as Windows attempts to develop a new kind of PC. With its lightning-fast time to first token answers and power-efficient token throughput, this very effective and OS-managed version of Phi is made to be preloaded in memory and may be used in tandem with other PC programs.
It can be easily integrated into applications as developer APIs and is already utilized in a number of productivity programs, such as Outlook, which provides its Copilot summary features offline. It is also utilized in core experiences like Click to Do, which offers helpful text intelligence tools for any content on your screen. These compact yet powerful models have previously undergone integration and optimization to be utilized in a variety of applications within the scope of PC ecosystem. The Phi-4-mini-reasoning and Phi-4-reasoning models will soon be able to operate on Copilot+ PC NPUs and take use of the low-bit optimizations for Phi Silica.
Safety and Microsoft’s approach to responsible AI
Responsible AI is a core tenet of Microsoft that directs the creation and application of AI systems, such as Phi models. The values of Microsoft AI accountability, transparency, justice, safety and dependability, privacy and security, and inclusivity are followed in the development of Phi models.
Using a mix of Direct Preference Optimisation (DPO), Reinforcement Learning from Human Feedback (RLHF), and Supervised Fine-Tuning (SFT) approaches, the Phi family of models has implemented a strong safety post-training strategy. These techniques make use of a variety of datasets, such as openly accessible datasets centred on helpfulness and harmlessness, as well as a range of safety-related queries and responses.
It is crucial to recognize that any AI models may have limits, even if the Phi family of models is made to efficiently complete a variety of tasks. The model cards below offer comprehensive details on responsible AI practices and standards to help you better appreciate these limits and the steps being taken to solve them.