Modernizing Network Monitoring with InfluxDB and Telegraf

Navigate to:

This article was originally published in The New Stack.

As the technology landscape continues to change at a rapid pace, enterprise companies are in a rush to catch up and modernize their legacy IT and network infrastructure to capture the benefits of newly developed tools and best practices.

By adopting modern DevOps techniques, they can reduce their operational costs, increase the reliability of their services and improve the overall speed and agility at which their IT teams are able to move.

Background

Network to Code is a vendor-agnostic network automation solution provider that helps enterprises bring modern DevOps practices to their organization. Network to Code was founded on the idea that companies would benefit tremendously by adopting the type of techniques used by software engineers and applying those techniques to their IT and network infrastructure.

Network to Code helps companies transform the way their networks are deployed, managed, and consumed. They do this by applying tools and techniques generally used for software engineering and bringing them to network and IT infrastructure. Network to Code provides both custom solutions as well as training to get enterprise teams up and running quickly while adopting an entirely new approach to managing their infrastructure.

To do this, Network to Code relies on a number of open source tools they can customize for their customers’ use case. Some of the most commonly used tools are Ansible, Puppet, Terraform, Telegraf, InfluxDB, Prometheus and Grafana.

Collecting telemetry data without reinventing the wheel

One of the biggest problems Network to Code faces when working with new customers is simply getting visibility into the company’s infrastructure. This is due to the fact that a number of hardware providers only provide vendor-specific tools for monitoring. Many of these tools are also outdated and can’t scale to provide the fine-grained telemetry data needed for real-time monitoring and insights that provide business value.

In the past, it wasn’t uncommon for networking monitoring tools to only provide metrics at intervals of a minute or even longer. To improve the reliability of their networks and respond quickly to problems, network teams need to know what is happening in seconds instead of minutes. Network to Code needed a way to efficiently collect data from all these different hardware sources without having to create a custom solution for every client.

In addition to simply collecting telemetry data, Network to Code also needed a way to store and query the data they were collecting. All the data in the world is useless if you aren’t able to quickly and efficiently analyze your data. Ideally this datastore would be extensible enough to be accessed from any potential analysis tools clients were already using.

Network to Code found a solution to telemetry data collection and storage with Telegraf and InfluxDB.

Collecting telemetry data with Telegraf

Telegraf is an open source, plugin-driven server agent used for collecting metrics and events from numerous sources. Network to Code’s data collection pipeline looks like this:

Network to Code's Telegraf pipeline

Network to Code's Telegraf pipeline

Some of the primary benefits Network to Code saw from choosing Telegraf:

It's easy to deploy

Telegraf is deployed as a single binary with no external dependencies. This is important because Network to Code needs to be able to run their metrics collector on their client’s hardware where they can’t guarantee what the environment will be. Telegraf makes the deployment process less time-intensive, which allows new clients to be onboarded faster.

Over 250 input plugins

Another advantage of Telegraf for Network to Code is the open source community, which has created over 250 input plugins. This means that Network to Code doesn’t have to reinvent the wheel because plugins already exist to collect data for almost every tool or network protocol their customers are using.

Example Telegraf configuration used by Network to Code for gNMI protocol

Example Telegraf configuration used by Network to Code for gNMI protocol

Data processing built-in

In addition to input and output plugins, Telegraf provides the ability to create “processor” plugins, which can be used to transform or enrich data before it is stored. These plugins give Network to Code the option to use tools they are comfortable with like Python and regex expressions to rename fields, normalize data, enrich data or modify data in flight before it is stored without requiring extra steps in their pipeline.

Example Regex processor plugin

Example Regex processor plugin

Scalable

A final benefit of using Telegraf is the scalability. While Telegraf could be deployed as a single instance that handles all incoming hardware metrics, Network to Code chooses to deploy a Telegraf instance for each individual piece of customer hardware. This makes their entire setup more reliable and scalable, while also allowing them to update configuration on individual machines without affecting metric collection anywhere else.

Storing and analyzing telemetry data with InfluxDB

Network to Code also needed a database to store their telemetry data once it had been collected and transformed by Telegraf. A natural choice was InfluxDB because it was designed for the exact type of time-series metrics data they would be storing. Some of the major benefits Network to Code saw from using InfluxDB:

Query performance

InfluxDB is a time-series database, meaning it was designed from the ground up for working with time-series data. Common queries like grabbing all metrics within a time range are faster and more efficient compared to using a standard relational database.

Data visualization

Being able to visualize and make data easy to understand is another bonus of using InfluxDB for Network to Code. Many popular dashboarding tools like Grafana provide direct integration with InfluxDB for creating data visualizations. InfluxDB also provides direct access to your data via REST API if you want more flexibility for adding custom charts and dashboards using the programming language and charting library of your choice.

Example dashboard pulling data from InfluxDB

Example dashboard pulling data from InfluxDB

Reduced storage costs

Another advantage of being a specialized time-series database is that InfluxDB can make assumptions about the structure of incoming data. This means that InfluxDB can use a variety of compression algorithms that reduce the size of the data being stored. This can result in significant cost savings when storing petabytes of telemetry data. For Network to Code this means their customers can store more of their historical data for less money, which means they can make better informed decisions.

Automated workflows

InfluxDB provides an integrated task and alert system which can be used to automate many monitoring tasks. Queries can be scheduled to run at defined time intervals and if the result is outside a certain threshold, alerts can be sent to on-call engineers to take action.

Conclusion

By using Telegraf and InfluxDB, Network to Code is able to quickly and efficiently set up telemetry monitoring solutions for their clients. This allows them to move faster and start adding real value to their clients once they have their telemetry monitoring in place and gathering business insights. This data can be used to respond to outages faster, cut costs by identifying situations where hardware is over-provisioned and allow network engineering teams to deploy with confidence that they have full observability of their infrastructure.