Nvidia SMI and Librato Integration

Powerful performance with an easy integration, powered by Telegraf, the open source data connector built by InfluxData.

info

This is not the recommended configuration for real-time query at scale. For query and compression optimization, high-speed ingest, and high availability, you may want to consider Nvidia SMI and InfluxDB.

5B+

Telegraf downloads

#1

Time series database
Source: DB Engines

1B+

Downloads of InfluxDB

2,800+

Contributors

Table of Contents

Powerful Performance, Limitless Scale

Collect, organize, and act on massive volumes of high-velocity data. Any data is more valuable when you think of it as time series data. with InfluxDB, the #1 time series platform built to scale with Telegraf.

See Ways to Get Started

Input and output integration overview

The Nvidia SMI Plugin enables the retrieval of detailed statistics about NVIDIA GPUs attached to the host system, providing essential insights for performance monitoring.

The Librato plugin for Telegraf is designed to facilitate seamless integration with the Librato Metrics API, allowing for efficient metric reporting and monitoring.

Integration details

Nvidia SMI

The Nvidia SMI Plugin is designed to gather metrics regarding the performance and status of NVIDIA GPUs on the host machine. By leveraging the capabilities of the nvidia-smi command-line tool, this plugin pulls crucial information such as GPU memory utilization, temperature, fan speed, and various performance metrics. This data is essential for monitoring GPU health and performance in real-time, particularly in environments where GPU performance directly impacts computing tasks, such as machine learning, 3D rendering, and high-performance computing. The plugin provides flexibility by allowing users to specify the path to the nvidia-smi binary and configure polling timeouts, accommodating both Linux and Windows systems where the nvidia-smi tool is commonly located. With its ability to collect detailed statistics on each GPU, this plugin becomes a vital resource for any infrastructure relying on NVIDIA hardware, facilitating proactive management and performance tuning.

Librato

The Librato plugin enables Telegraf to send metrics to the Librato Metrics API. To authenticate, users must provide an api_user and api_token, which can be acquired from the Librato account settings. This integration allows for efficient monitoring and reporting of custom metrics within the Librato platform. The plugin also utilizes a source_tag option that can enrich the metrics with contextual information from Point Tags; however, it does not currently support sending associated Point Tags. It is essential to note that any point value sent that cannot be converted to a float64 type will be skipped, ensuring that only valid metrics are processed and sent to Librato. The plugin also supports secret-store options for managing sensitive authentication credentials securely, facilitating best practices in credential management.

Configuration

Nvidia SMI

[[inputs.nvidia_smi]]
  ## Optional: path to nvidia-smi binary, defaults "/usr/bin/nvidia-smi"
  ## We will first try to locate the nvidia-smi binary with the explicitly specified value (or default value),
  ## if it is not found, we will try to locate it on PATH(exec.LookPath), if it is still not found, an error will be returned
  # bin_path = "/usr/bin/nvidia-smi"

  ## Optional: timeout for GPU polling
  # timeout = "5s"

Librato

[[outputs.librato]]
  ## Librato API Docs
  ## http://dev.librato.com/v1/metrics-authentication
  ## Librato API user
  api_user = "[email protected]" # required.
  ## Librato API token
  api_token = "my-secret-token" # required.
  ## Debug
  # debug = false
  ## Connection timeout.
  # timeout = "5s"
  ## Output source Template (same as graphite buckets)
  ## see https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md#graphite
  ## This template is used in librato's source (not metric's name)
  template = "host"

Input and output integration examples

Nvidia SMI

  1. Real-Time GPU Monitoring for ML Training: Continuously monitor the GPU utilization and memory usage during machine learning model training. This enables data scientists to ensure that their GPUs are not being overutilized or underutilized, optimizing resource allocation and reviewing performance bottlenecks in real-time.

  2. Automated Alerts for Overheating GPUs: Implement a system using the Nvidia SMI plugin to track GPU temperatures and set alerts for instances where temperatures exceed safe thresholds. This proactive monitoring can prevent hardware damage and improve system reliability by alerting administrators to potential cooling issues before they result in failure.

  3. Performance Baselines for GPU Resources: Establish baseline performance metrics for your GPU resources. By regularly collecting data and analyzing trends in GPU usage, organizations can identify anomalies and optimize their workloads accordingly, leading to enhanced operational efficiency.

  4. Dockerized GPU Usage Insights: In a containerized environment, use the plugin to monitor GPU performance from within a Docker container. This allows developers to track GPU performance of their applications in production, facilitating troubleshooting and performance optimization within isolated environments.

Librato

  1. Real-time Application Monitoring: Utilize Librato to collect performance metrics from a web application in real-time. This setup involves sending response times, error rates, and user interactions to Librato, allowing developers to monitor the application’s health and performance metrics closely. By analyzing these metrics, teams can quickly identify and address performance bottlenecks or application failures before they impact end users.

  2. Infrastructure Metrics Aggregation: Leverage this plugin to gather and send metrics from various infrastructure components, such as servers or containers, to Librato for centralized monitoring. Configuring the plugin to send CPU, memory usage, and disk I/O metrics enables system administrators to have a comprehensive view of infrastructure performance, assisting in capacity planning and resource optimization strategies.

  3. Custom Metrics for Business Operations: Feed business-specific metrics, such as sales transactions or user sign-ups, to the Librato service using this plugin. By tracking these custom metrics, businesses can gain insights into their operational performance and make data-driven decisions to enhance their strategies, marketing efforts, or product development initiatives.

  4. Anomaly Detection in Metrics: Implement monitoring tools that utilize machine learning for anomaly detection. By continuously sending real-time metrics to Librato, teams can analyze trends and automatically flag unusual behavior, such as sudden spikes in latency or unusual traffic patterns, enabling timely intervention and troubleshooting.

Feedback

Thank you for being part of our community! If you have any general feedback or found any bugs on these pages, we welcome and encourage your input. Please submit your feedback in the InfluxDB community Slack.

Powerful Performance, Limitless Scale

Collect, organize, and act on massive volumes of high-velocity data. Any data is more valuable when you think of it as time series data. with InfluxDB, the #1 time series platform built to scale with Telegraf.

See Ways to Get Started

Related Integrations

HTTP and InfluxDB Integration

The HTTP plugin collects metrics from one or more HTTP(S) endpoints. It supports various authentication methods and configuration options for data formats.

View Integration

Kafka and InfluxDB Integration

This plugin reads messages from Kafka and allows the creation of metrics based on those messages. It supports various configurations including different Kafka settings and message processing options.

View Integration

Kinesis and InfluxDB Integration

The Kinesis plugin allows for reading metrics from AWS Kinesis streams. It supports multiple input data formats and offers checkpointing features with DynamoDB for reliable message processing.

View Integration