All Integrations / NVIDIA SMI Telegraf Input Plugin

NVIDIA SMI Telegraf Input Plugin

The NVIDIA System Management Interface (SMI) is a command line utility that helps with managing NVIDIA Graphics Processing Unit (GPU) devices. A GPU is a kind of computing technology designed for parallel processing that is frequently used in graphics and video rendering. The NVIDIA SMI lets users query and modify a GPU device state. It can report information returned from queries as plain text or Extensible Markup Language (XML) to a file or some other output. It’s meant to work with Tesla, GRID, Quartdro, and Titax X products, but there is also limited support on other NVIDIA GPUs.

Why use a Telegraf plugin for NVIDIA SMI?

This plugin pulls GPU stats such as memory usage, GPU usage, and temperature from the NVIDIA SMI binary. This lets you carefully monitor your GPU device and quickly detect any problems that occur. You can send this data to InfluxDB to use its built-in tools to analyze this data over time. You can also set up alerts to detect changes in metrics, such as if the temperature of a device crosses an established threshold.

How to monitor NVIDIA SMI using the Telegraf plugin

To configure this plugin you can set the path to your NVIDIA SMI binary, or leave it at the default /usr/bin/nvidia-smi. If the path isn’t found, the plugin will try to locate it on PATH(exec.LookPath) and if that doesn’t work it will return an error. You can optionally set a timeout for GPU polling, for example timeout = “5s”

Key NVIDIA SMI metrics to use for monitoring

Some of the important NVIDIA SMI metrics that you should proactively monitor include:

measurement: nvidia_smi
- tags
  - name (type of GPU e.g. GeForce GTX 1070 Ti)
  - compute_mode (The compute mode of the GPU e.g. Default)
  - index (The port index where the GPU is connected to the motherboard e.g. 1)
  - pstate (Overclocking state for the GPU e.g. P0)
  - uuid (A unique identifier for the GPU e.g. GPU-f9ba66fc-a7f5-94c5-da19-019ef2f9c665)
- fields
  - fan_speed (integer, percentage)
  - fbc_stats_session_count (integer)
  - fbc_stats_average_fps (integer)
  - fbc_stats_average_latency (integer)
  - memory_free (integer, MiB)
  - memory_used (integer, MiB)
  - memory_total (integer, MiB)
  - power_draw (float, W)
  - temperature_gpu (integer, degrees C)
  - utilization_gpu (integer, percentage)
  - utilization_memory (integer, percentage)
  - utilization_encoder (integer, percentage)
  - utilization_decoder (integer, percentage)
  - pcie_link_gen_current (integer)
  - pcie_link_width_current (integer)
  - encoder_stats_session_count (integer)
  - encoder_stats_average_fps (integer)
  - encoder_stats_average_latency (integer)
  - clocks_current_graphics (integer, MHz)
  - clocks_current_sm (integer, MHz)
  - clocks_current_memory (integer, MHz)
  - clocks_current_video (integer, MHz)
  - driver_version (string)
  - cuda_version (string)

For more information, please check out the documentation.

Project URL Documentation

Related resources

Text on the right

NVIDIA SMI Telegraf Input Plugin

Why use a Telegraf plugin for NVIDIA SMI?

How to monitor NVIDIA SMI using the Telegraf plugin

Key NVIDIA SMI metrics to use for monitoring

Related resources

Developer Education

Learn more about InfluxDB

Performance Benchmarking: InfluxDB 3.0 vs. InfluxDB Open Source

InfluxDB for Industrial IoT:  
A Live Demonstration

How Time Series Databases and Data Lakes Work Together

Data Warehousing

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2024

NVIDIA SMI Telegraf Input Plugin

Why use a Telegraf plugin for NVIDIA SMI?

How to monitor NVIDIA SMI using the Telegraf plugin

Key NVIDIA SMI metrics to use for monitoring

Related resources

Temp

Mem

Kernel VMStat

Developer Education

Learn more about InfluxDB

Performance Benchmarking: InfluxDB 3.0 vs. InfluxDB Open Source

InfluxDB for Industrial IoT: A Live Demonstration

How Time Series Databases and Data Lakes Work Together

Data Warehousing

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2024

InfluxDB for Industrial IoT:  
A Live Demonstration