NVIDIA Mission Control software boosts DGX systems’ AI power enhancing GPU utilization and efficiency for training and inference.
Modern organizations need AI factories to swiftly convert data into insights that are scalable, precise, and dependable, much like industrial factories turn raw materials into things.
Building steam-powered factories was much simpler than planning this new infrastructure. Supercomputing-scale resources are required for state-of-the-art models. Any interruption has the danger of lowering GPU use and halting weeks of work.
At the NVIDIA GTC global AI conference, NVIDIA announced NVIDIA Mission Control, the only unified operations and orchestration software platform that automates the complex management of AI data centers and workloads, allowing developers and enterprises to manage and run AI factories at lightning speed.
Every facet of AI manufacturing operations is improved by NVIDIA Mission Control software. Its features enable businesses get frontier models up and running more quickly, from operating developer workloads to verifying infrastructure to configuring deployments.
It is made to quickly and efficiently scale test-time and smoothly move NVIDIA Blackwell-based systems from pretraining to post-training. By dynamically reallocating cluster resources to align with changing priorities, the software allows businesses to effortlessly switch between training and inference workloads on their Blackwell-based NVIDIA DGX and NVIDIA Grace Blackwell systems.
NVIDIA Run:ai technology is also incorporated into Mission Control to improve infrastructure utilization by up to five times by streamlining operations and job orchestration for development, training, and inference.
In comparison to conventional techniques that depend on human intervention, Mission Control’s autonomous recovery capabilities bolstered by automated tiered restart features and rapid checkpointing can provide up to 10 times faster job recovery, increasing the efficiency of AI training and inference to maintain AI applications.
Mission Control, which reduces the amount of time spent managing AI infrastructure, is based on decades of NVIDIA supercomputing expertise and enables businesses to run models with ease. To make cutting-edge AI infrastructure more widely available to industries worldwide, it automates the lifetime of AI industrial infrastructure for all NVIDIA Blackwell-based NVIDIA DGX and NVIDIA Grace Blackwell systems from Supermicro, Lenovo, Dell Technologies, and Hewlett Packard Enterprise (HPE).
Businesses can use Mission Control software with the NVIDIA Instant AI Factory service preconfigured in Equinix AI-ready data centers across 45 markets worldwide to significantly streamline and expedite deployments of NVIDIA DGX GB300 and DGX B300 systems.
Advanced Software Provides Enterprises Uninterrupted Infrastructure Oversight
To ensure continuous operations, Mission Control software automates end-to-end infrastructure management, including provisioning, monitoring, and error diagnostics. Additionally, it keeps an eye on all tiers of the infrastructure and application stack to anticipate and pinpoint the causes of inefficiencies and outages, saving money, time, and energy.
Other advantages of NVIDIA Mission Control software include:
- New automation and standardized application programming interfaces have made cluster setup and provisioning easier. Integrated inventory management and visualizations have further sped up deployment time.
- Streamlined Slurm and Kubernetes workflows through smooth workload orchestration.
- With developer-selectable parameters, energy-optimized power profiles balance power consumption and adjust GPU performance for different workload types.
- In order to optimize developer productivity and infrastructure resilience, autonomous job recovery can detect, isolate, and recover from inefficiencies without the need for human involvement.
- Dashboards that may be customized to monitor important performance metrics and provide access to vital cluster telemetry data.
- Throughout the infrastructure lifecycle, hardware and cluster performance may be verified via on-demand health checks.
- Integration with building management systems for improved coordination to give more control over cooling and power events, including quick leak detection.
Base Command Manager Offers Free Kickstart for AI Cluster Management
It is anticipated that the NVIDIA Base Command Manager software will soon be free for up to eight accelerators per system, regardless of cluster size, with the option to purchase NVIDIA Enterprise Support separately, to assist businesses with infrastructure management.
NVIDIA Mission Control Software
Bringing the World’s Most Advanced AI Factory Expertise to Every Business
With the help of a top-tier operations staff and tools, NVIDIA Mission Control Software supports all facets of AI factory operations, including infrastructure, facilities, and developer workloads. It provides full-stack intelligence that offers top-notch infrastructure resilience and instant agility to inference and training workloads, powering NVIDIA Blackwell data centers for the latest AI frontiers. Mission Control speeds up AI experimentation by enabling hyperscale-grade AI performance across all enterprises.
Experience the Benefits of NVIDIA Mission Control
Instant Agility
Utilize superior cluster control, workload flexibility, and seamless orchestration to add agility to mission-critical tasks.
Hyperscale-Grade Efficiency
Get professional AI factory operations to automate processes, close important skill gaps, and run data centers intelligently around-the-clock.
Gold-Standard Infrastructure Resiliency
With proactive monitoring, quick failure detection, and ten times faster recovery times for training and inference runs, redefine infrastructure resilience.
Accelerated AI Experimentation
Optimize compute cycles and workload usage to increase developer productivity and set a new benchmark for enterprise AI at scale.
Availability
Now available: NVIDIA Mission Control for DGX GB200 and B200 systems. NVIDIA GB200 NVL72 systems with Mission Control are coming from Dell, HPE, Lenovo, and Supermicro.
It expect NVIDIA Mission Control to be available later this year for the latest NVIDIA DGX GB300, B300, and GB300 NVL72 systems from prominent global vendors.