Red Hat Uses InfluxDB to Collect gNMI Data for Internal Network Monitoring

Navigate to:

Red Hat is a global leader for open source enterprise IT solutions with a portfolio of products that includes hybrid cloud infrastructure, middleware, cloud-native applications, and automation solutions.

The business challenge

Managing Red Hat’s enterprise IT infrastructure is a massive undertaking that involves monitoring the backbone that supports over 40,000 employees in forty different countries. Red Hat’s internal network monitoring team monitors over sixty of the company’s 105 global office locations. In total, that works out to be 14,000+ interfaces and 1,600+ devices.

Monitoring at Red Hat revolves around performance metrics and visualizations. The network monitoring team wants to know how the network infrastructure is performing, and they visualize that data to better understand that performance.

To achieve this worldwide infrastructure observability, the network monitoring team sought to build a monitoring solution that functioned as a single source of truth for the network. To do so, they needed to be able to collect data from across the globe, such as data on device availability (e.g., ping, http, DNS), query speed, http response times and codes, external link utilization, latency, and more.

The technical challenge

With so many different interfaces and devices across the globe, the Red Hat team needed to be able to collect data from a wide range of data sources using the most efficient protocols available. They needed to visualize network performance, generate alerts, create network maps, and monitor network bandwidth, and needed a way to collect data to support these observability goals.

One of the biggest challenges is that the SNMP protocol remains very common in network monitoring. However, SNMP has several key limitations, so the team is moving to Google’s Network Management Interface (gNMI) wherever possible. gNMI provides more granular data polling intervals, and can collect and store data types and metrics that SNMP cannot.

However, not every device in Red Hat’s environment supports gNMI, so how did the company bring everything together for its single source of truth?

The solution

Red Hat runs an enterprise instance of InfluxDB, which is a critical piece in their network monitoring architecture. Red Hat uses Telegraf and the appropriate SNMP or gNMI plugins to collect data directly from network devices. They collect gNMI whenever possible but some devices only support SNMP, or are in the process of updating to gNMI support, so the data from these devices comes in via SNMP.

Telegraf enriches data when necessary before passing it to Kapacitor for analysis. If Kapacitor detects an issue, the system sends an alert to the appropriate personnel. Red Hat stores the analyzed SNMP and gNMI data in different measurements in InfluxDB and writes custom queries for each. Using the Flux language, they can combine different measurements at the query level, while keeping the data separate in the storage tier.

InfluxDB diagram

Dashboards generated from this data include a variety of information, like historic SLI/SLO data and real-time data visualizations.

Dashboards generated from data

Results

The architecture diagram below shows the different components and data flows that comprise Red Hat’s network monitoring solution. They rely on Ansible to coordinate network automation for device management and to configure Telegraf, Kapacitor, and InfluxDB instances.

The architecture diagram with components and data flows that comprise Red Hat network monitoring solution

Thanks to this high degree of automation, this solution requires relatively little manual intervention, allowing support engineers to focus on critical issues, rather than managing individual devices and components. InfluxDB helps to provide those individuals with the broadest range of data possible to feed the single source of truth and facilitate automation, which improves real-time monitoring capabilities.

For more details about Red Hat’s solution, read the full case study.