Saturday, July 6, 2024

NVIDIA GPUs monitoring issues on compute engine

For the requisite computational horsepower, businesses that use AI and ML for applications like product recommendations, scientific computing, and gaming often turn to NVIDIA GPUs on Google Cloud. They must keep an eye on the GPU performance indicators in order to comprehend the nature of their workload and streamline the ML development process. We’re happy to report that Ops Agent now gathers metrics from NVIDIA GPUs on Compute Engine VMs to assist.

The telemetry tool for Compute Engine that Google recommends for monitoring VM instances is called Cloud Ops Agent. You can now get better insight into your NVIDIA GPUs and accelerated workloads thanks to key metrics from the NVIDIA Management Library (NVML) and sophisticated profiling data from the NVIDIA Data Center GPU Manager (DCGM).

Ops Agent allows you to:

  • Utilize dashboards that come pre-built with GPU analytics to see how your GPU fleet is doing.
  • Identify underused GPUs and combine workloads to save expenses.
  • Plan scaling by observing patterns to determine when to increase GPU capacity or modernize current GPUs.
  • Determine which ML models on the GPU are using the most memory and CPU resources.
  • To locate GPU performance problems and bottlenecks, use the DCGM profiling measures.
  • Keep an eye on your GPU stats.

Get important GPU parameters straight now

The nvidia-smi command, which offers a summary of all GPU devices and the processes operating on them, is undoubtedly recognizable to those who utilize NVIDIA GPUs. Without further settings, Ops Agent may gather such crucial metrics by using the same underlying API in NVML. This comprises measurements for:

  • Utilizing GPU
  • memory use on the GPU
  • CPU and GPU memory usage maximum
  • CPU and GPU lifetime usage

With DCGM, collect sophisticated GPU metrics

To manage and keep track of NVIDIA GPUs at scale, NVIDIA offers a set of tools called DCGM. It provides an API for high-level measurements for sophisticated profiling of various hardware parts, such as streaming processors, connections like NVLink, and more. With the Ops Agent DCGM integration, we have put up a list of these advanced metrics.

Monitor the condition of your GPUs

You can quickly query and see the gathered GPU metrics from Ops Agent using the other services in Google Cloud’s operations suite. To develop queries, make custom charts, and add them to dashboards, use our Metrics Explorer query builder or PromQL. Utilizing GPU data gathered from both GKE GPU nodes and Compute Engine GPU VMs, our NVIDIA GPU Monitoring dashboard offers a single point of access to your entire GPU fleet. For information on adding this dashboard to your project, see the documentation. The DCGM dashboard, which provides a concentrated view of the GPU profiling metrics, is immediately added to your project for the Cloud Monitoring DCGM integration once DCGM metrics collecting starts.

One single agent for VM logging, tracing, and monitoring

Ops Agent is a robust, unified telemetry agent with an easy-to-use configuration interface that allows you to do more than simply see what your GPUs are doing:

  • Automatically gather host metrics like those for the CPU, RAM, and processes.
  • Automatically gather system logs from Windows and Linux virtual machines, such as syslog
  • Gather data from your workloads using the Prometheus and OpenTelemetry Protocol (OTLP) protocols.
  • To upload log files from your machine learning workloads to Cloud Logging, use the logging files receiver.
  • Change the collecting interval of your NVML and DCGM metrics using metrics processors, or filter out any unnecessary metrics. You can filter and only retain the metrics you need using metrics processors with NVML and DCGM metrics, and you can quickly adjust the collection interval of those metrics through the configuration file.

Additionally, you can concentrate more on using your GPU VMs when you just have one agent to handle.

Start now

If you construct a VM using the Google Cloud interface, would you want to check out Ops Agent? We have made it easier to install an operations agent when building a new virtual machine. By doing this, you may test out Ops Agent’s default settings before selecting how to manage your VMs and Ops Agents at scale.

News source:

agarapuramesh
agarapurameshhttps://govindhtech.com
Agarapu Ramesh was founder of the Govindhtech and Computer Hardware enthusiast. He interested in writing Technews articles. Working as an Editor of Govindhtech for one Year and previously working as a Computer Assembling Technician in G Traders from 2018 in India. His Education Qualification MSc.
RELATED ARTICLES

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Popular Post

Govindhtech.com Would you like to receive notifications on latest updates? No Yes