AI now encompasses profound reasoning, intricate problem-solving, and potent adaptability for practical applications in business, finance, customer service, and healthcare. It is no longer only about producing text or images.
The most recent Llama Nemotron Ultra reasoning model from NVIDIA is now available. It increases computation efficiency while delivering the highest accuracy among open-source models across intelligence and coding benchmarks. Hugging Face has the model, weights, and training data for AI workflow automation, research assistants, and coding copilots.
NVIDIA Llama Nemotron Ultra excels at math and science coding
The capabilities of AI in scientific reasoning, coding, and math standards are being redefined by Llama Nemotron Ultra. With the depth and adaptability needed for high-impact AI, the model is constructed for real-world industry demands, ranging from copilots and knowledge assistants to automated processes. It is post-trained for complex reasoning, human-aligned conversation, RAG, and tool use.
Llama Nemotron Ultra improves upon Llama 3.1 by utilizing both synthetic and commercial data, along with sophisticated training methods. Developed for agentic processes, it provides affordable, high-performance AI with robust reasoning capabilities. NVIDIA has made two excellent training datasets used in post-training publicly available to facilitate the wider development of reasoning models.
These materials provide the community a head start in creating models that are both cost-effective and high-performing. The NVIDIA team, which recently won first place in a competitive reasoning benchmark at the @KaggleAI Mathematical Olympiad, demonstrated their efficacy. After then, Llama Nemotron Ultra was subjected to the data, technology, and insights. These three standards are examined in detail in the following sections.
GPQA Diamond benchmark
In a scientific reasoning benchmark, the Llama Nemotron Ultra thinking model performs better than other open models, as seen in Figures 1, 2 and 3. The 198 carefully constructed questions in the fields of biology, physics, and chemistry that make up the GPQA Diamond benchmark were created by PhD-level specialists.
These graduate-level questions need profound understanding and multistep thinking, which go far beyond simple memory or superficial inference. With an accuracy of 76%, Llama Nemotron Ultra has achieved a new benchmark and established itself as the top open model in scientific reasoning, although humans with PhDs typically achieve about 65% accuracy on this difficult subset. The Vellum and Artificial Analysis leaderboards show this outcome.



LiveCodeBench benchmark
As demonstrated in Figures 4, 5 Llama Nemotron Ultra has demonstrated exceptional performance on LiveCodeBench, a reliable benchmark intended to evaluate real-world coding skills, in addition to performing exceptionally well on advanced science benchmarks. Code generation, debugging, self-repair, test output prediction, and execution are among the more general coding activities that LiveCodeBench focuses on.
In LiveCodeBench, every issue is date-stamped to guarantee impartial, out-of-distribution assessment. It checks true generalization by prioritizing real problem-solving over code output. The leaderboards for GitHub LiveCodeBench and Artificial Analysis both display this outcome.


AIME benchmark
In the AIME benchmark, which is frequently used to assess mathematical reasoning skills, Llama Nemotron Ultra outperforms other open models. View the LLM leaderboard in real time.

Open datasets and tools
Llama Nemotron’s open design concept is among its most important accomplishments. The model itself and two key, commercially viable datasets that shaped its reasoning abilities were released by NVIDIA AI. These datasets are now trending at the top of Hugging Face Datasets.
Over 735K Python samples from 28K distinct problems from well-known competitive programming platforms make up the OpenCodeReasoning Dataset. This dataset, which was created especially for supervised fine-tuning (SFT), allows enterprise developers to include sophisticated reasoning skills into their models. Organizations may improve the ability of AI systems to solve problems by utilizing OpenCodeReasoning, which will result in more intelligent and resilient code solutions.
The Llama-Nemotron-Post-Training Dataset was created artificially utilizing open and publically accessible models, such as the DeepSeek-R1 models, the Nemotron family, the Qwen family, and Llama. This dataset, which is intended to improve a model’s performance on important reasoning tasks, is perfect for enhancing general reasoning, math, coding, and instruction following skills. It provides a useful tool for optimizing models to comprehend and react to intricate, multi-step instructions more effectively, assisting developers in creating AI systems that are more capable and cohesive.
NVIDIA hopes to democratize the training of reasoning models by making these datasets available for free on Hugging Face. Now that startups, research labs, and businesses can access the same resources as NVIDIA internal teams, the wider adoption of agentic AI which is capable of reasoning, planning, and acting on its own inside complex workflows is accelerated.
Enterprise-ready features: Speed, accuracy, and flexibility
A commercially successful model, Llama Nemotron Ultra can be applied to a range of agentic AI use cases, such as task-oriented assistants, autonomous research agents, chatbots for customer support, and coding copilots. It is a great basis for real-world applications that require accuracy, flexibility, and multistep problem solving due to its outstanding performance in scientific reasoning and coding benchmarks.
In the open-reasoning model class, Llama Nemotron Ultra provides the highest throughput and the best model accuracy. Savings are closely correlated with its throughput (efficiency). In order to execute the model in a data center setting, it significantly lower the model’s memory footprint while maintaining performance by using a Neural Architecture Search (NAS) technique. This allows for greater workloads and fewer GPUs.
The model then went through a thorough post-training pipeline that included reinforcement learning (RL) and supervised fine-tuning to enhance its capabilities and make it superior at both reasoning and non-reasoning tasks. Businesses can use reasoning only when necessary with the model’s “On” and “Off” features, which also lowers overhead for easier, non-agentic tasks.