NVIDIA SMI Telegraf Input Plugin

Use This InfluxDB Integration for Free

The NVIDIA System Management Interface (SMI) is a command line utility that helps with managing NVIDIA Graphics Processing Unit (GPU) devices. A GPU is a kind of computing technology designed for parallel processing that is frequently used in graphics and video rendering. The NVIDIA SMI lets users query and modify a GPU device state. It can report information returned from queries as plain text or Extensible Markup Language (XML) to a file or some other output. It’s meant to work with Tesla, GRID, Quartdro, and Titax X products, but there is also limited support on other NVIDIA GPUs.

Why use a Telegraf plugin for NVIDIA SMI?

This plugin pulls GPU stats such as memory usage, GPU usage, and temperature from the NVIDIA SMI binary. This lets you carefully monitor your GPU device and quickly detect any problems that occur. You can send this data to InfluxDB to use its built-in tools to analyze this data over time. You can also set up alerts to detect changes in metrics, such as if the temperature of a device crosses an established threshold.

How to monitor NVIDIA SMI using the Telegraf plugin

To configure this plugin you can set the path to your NVIDIA SMI binary, or leave it at the default /usr/bin/nvidia-smi. If the path isn’t found, the plugin will try to locate it on PATH(exec.LookPath) and if that doesn’t work it will return an error. You can optionally set a timeout for GPU polling, for example timeout = “5s”

Key NVIDIA SMI metrics to use for monitoring

Some of the important NVIDIA SMI metrics that you should proactively monitor include:

  • measurement: nvidia_smi
    • tags
      • name (type of GPU e.g. GeForce GTX 1070 Ti)
      • compute_mode (The compute mode of the GPU e.g. Default)
      • index (The port index where the GPU is connected to the motherboard e.g. 1)
      • pstate (Overclocking state for the GPU e.g. P0)
      • uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
    • fields
      • fan_speed (integer, percentage)
      • fbc_stats_session_count (integer)
      • fbc_stats_average_fps (integer)
      • fbc_stats_average_latency (integer)
      • memory_free (integer, MiB)
      • memory_used (integer, MiB)
      • memory_total (integer, MiB)
      • power_draw (float, W)
      • temperature_gpu (integer, degrees C)
      • utilization_gpu (integer, percentage)
      • utilization_memory (integer, percentage)
      • utilization_encoder (integer, percentage)
      • utilization_decoder (integer, percentage)
      • pcie_link_gen_current (integer)
      • pcie_link_width_current (integer)
      • encoder_stats_session_count (integer)
      • encoder_stats_average_fps (integer)
      • encoder_stats_average_latency (integer)
      • clocks_current_graphics (integer, MHz)
      • clocks_current_sm (integer, MHz)
      • clocks_current_memory (integer, MHz)
      • clocks_current_video (integer, MHz)
      • driver_version (string)
      • cuda_version (string)
For more information, please check out the documentation.

Project URL   Documentation

Related resources

InfluxDb-cloud-logo

The most powerful time series
database as a service

Get Started for Free
Influxdbu

Developer Education

Training for time series app developers.

View All Education

Text on the right