NVIDIA SMI Telegraf Input PluginUse This InfluxDB Integration for Free
The NVIDIA System Management Interface (SMI) is a command line utility that helps with managing NVIDIA Graphics Processing Unit (GPU) devices. A GPU is a kind of computing technology designed for parallel processing that is frequently used in graphics and video rendering. The NVIDIA SMI lets users query and modify a GPU device state. It can report information returned from queries as plain text or Extensible Markup Language (XML) to a file or some other output. It’s meant to work with Tesla, GRID, Quartdro, and Titax X products, but there is also limited support on other NVIDIA GPUs.
Why use a Telegraf plugin for NVIDIA SMI?
This plugin pulls GPU stats such as memory usage, GPU usage, and temperature from the NVIDIA SMI binary. This lets you carefully monitor your GPU device and quickly detect any problems that occur. You can send this data to InfluxDB to use its built-in tools to analyze this data over time. You can also set up alerts to detect changes in metrics, such as if the temperature of a device crosses an established threshold.
How to monitor NVIDIA SMI using the Telegraf plugin
To configure this plugin you can set the path to your NVIDIA SMI binary, or leave it at the default
/usr/bin/nvidia-smi. If the path isn’t found, the plugin will try to locate it on
PATH(exec.LookPath) and if that doesn’t work it will return an error. You can optionally set a timeout for GPU polling, for example
timeout = “5s”
Key NVIDIA SMI metrics to use for monitoring
Some of the important NVIDIA SMI metrics that you should proactively monitor include:
name(type of GPU e.g.
GeForce GTX 1070 Ti)
compute_mode(The compute mode of the GPU e.g.
index(The port index where the GPU is connected to the motherboard e.g.
pstate(Overclocking state for the GPU e.g.
uuid(A unique identifier for the GPU e.g.
temperature_gpu(integer, degrees C)