Microsoft Phi Small language models(SLMs)’ Most Recent Multimodal Addition Trained on NVIDIA GPUs.
Microsoft SLMs(Small language models)
Every industry has been impacted by large language models (LLMs), which have altered the possibilities of technology. They are impractical for many businesses’ present resource limitations, nevertheless, because of their enormous size.
By developing models with a lower resource footprint, Small language models (SLMs) are becoming more popular. SLMs are a subset of language models that are constructed using simpler neural networks and often concentrate on particular areas. Models must adapt to accept various types of multimodal input as they develop to reflect how people see the world.
Microsoft has announced two additional additions to the Phi family’s new generation of open SLMs:
- The Phi-4-mini
- Multimodal Phi-4
The first multimodal model to join the family that takes in inputs of text, audio, and images is Phi-4-multimodal.
These models can be deployed on-device because they are sufficiently tiny. This release allows the two new smaller models to be used commercially and builds upon the Phi-4 14B parameter SLM’s December 2024 research-only release.
The Azure AI Foundry, Microsoft’s Cloud AI platform for creating, modifying, and overseeing AI agents and applications, offers the new models.
The NVIDIA API Catalogue, the first sandbox environment to enable all of the modality and tool-calling for Phi-4-multimodal, allows you to test out every member of the Phi family. To include the model into your applications right now, use the preview NIM microservice.
Why make an investment in SLMs?
In contexts with limited memory and processing power, Small language models(SLMs) make generative AI possible. SLMs, for instance, can be installed directly on a number of consumer-grade devices, including smartphones. For use cases where compliance with regulatory standards is required, on-device deployment can help with privacy and compliance.
Small language models(SLMs) provide additional advantages over LLMs of comparable quality, such as reduced latency because of their innately faster inference. According to their training data, SLMs do have a tendency to perform better on specialized jobs. On the other hand, you can create effective agentic systems by utilising native-function calling or retrieval-augmented generation (RAG) to enhance generalization and flexibility to various tasks.
Phi-4-multimodal
Phi-4-multimodal takes text, image, and audio reasoning and has 5.6B parameters. This allows it to support use cases including visual reasoning, OCR, translation, multi-modal summarization, and automatic voice recognition (ASR). Over the course of 21 days, this model was trained on 512 NVIDIA A100-80GB GPUs.
You can evaluate your picture data and ask Phi-4-multimodal visual QA in the NVIDIA API Catalogue, as illustrated in Figure 1. Additionally, you may observe how to modify settings like temperature, sample values, and token restrictions. To make integrating the model into your applications easier, you can generate sample code in Python, JavaScript, and Bash.

Additionally, a collection of prebuilt agents can be used to demonstrate tool calling. A tool for retrieving real-time weather data is depicted in Figure 2.

The Phi-4-mini
With 3.8B parameters, Phi-4-mini is a dense, text-only, decoder-only Transformer model that is tailored for conversation. It has a 128K token long-form context window. Over the course of 14 days, 1024 NVIDIA A100 80GB GPUs were used to train this model.
The training data for both models is purposefully concentrated on excellent instructional material and code, giving the models a textbook-like appearance. The model cards contain benchmark data for speech, vision, and text.
Promoting community models
NVIDIA has released several hundred projects under open-source licenses and is an active participant in the open-source ecosystem. NVIDIA is dedicated to improving open models and community software like Phi, which encourages AI openness and enables users to widely exchange work on AI resilience and safety.
These open models can be tailored on proprietary data using the NVIDIA NeMo platform to be extremely effective and optimized for a variety of AI operations in any sector.
Long-standing partnerships between NVIDIA and Microsoft include research ranging from generative AI to healthcare and life sciences, integrations and optimizations for PC developers using NVIDIA RTX GPUs, and other partnerships advancing innovation on GPUs on Azure.
Start now
Visit build.nvidia.com/microsoft to test Phi-4 on the NVIDIA-accelerated platform with your data.
To see how this model will function for you in production, you can experiment with text, image, and audio in addition to sample tool calls on the first multi-modal sandbox for Phi-4-multimodal.