How to Use Time-Stamped Data to Reduce Network Downtime
By Caitlin Croft / Feb 10, 2023 / Monitoring, Community
This article was originally published in The New Stack and is reposted here with permission.
Telecommunication organizations need to ensure they have the necessary resources and technology to maintain service uptime SLAs.
Increased regulations and emerging technologies forced telecommunications companies to evolve quickly in recent years. These organizations’ engineers and site reliability engineering (SRE) teams must use technology to improve performance, reliability and service uptime. Learn how WideOpenWest uses a time series platform to monitor its entire service delivery network.
Trends in the telecommunications industry
Telecommunication companies are facing challenges that vary depending on where the company is in their life cycle. Across the industry, businesses must modernize their infrastructure while also maintaining legacy systems. At the same time, new regulations at both the local and federal levels increase the competition within the industry, and new businesses challenge the status quo set by current industry leaders.
In recent years, the surge in people working from home requires a more reliable internet connection to handle their increased network bandwidth needs. The increased popularity of smartphones and other devices means there are more devices requiring network connectivity — all without a reduction in network speeds. Latency issues or poor uptime lead to unhappy customers, who then become flight risks. Add to this situation more frequent security breaches, which then requires all businesses to monitor their networks to detect potential breaches faster.
Challenges to modernizing networks
Founded in 1996 in Denver, Colorado, WideOpenWest (WOW) provides internet, video and voice services in various markets across the United States. Over the years, WOW acquired various telecommunication organizations, and as its network expanded, it needed a better network monitoring tool to address a growing list of challenges. For instance, WOW engineers wanted to be able to analyze an individual customer’s cable modem, determine the health of a node and understand the overall state of the network. However, several roadblocks prevented the company from doing so. WideOpenWest already used multiple monitoring platforms internally, and the cost of purchasing hardware that aids in monitoring individual nodes was too expensive. It already had a basic process in place to collect telemetry data from specific modems, but there was no single source of truth to tie everything together.
Using time series data to reduce network latency
A few years ago, WideOpenWest decided to replace its legacy time series database, and after considering other solutions, it chose InfluxDB, the purpose-built time series database. It now has a four-node cluster of InfluxDB Enterprise in production and a two-node cluster running on OpenStack for testing. The team uses Ansible to automate cluster setup and installation.
The primary motivations for using InfluxDB are to improve overall observability of the entire network and to implement better alerting. The WOW engineers use Telegraf for data collection whenever possible because it integrates easily with all the other systems. Some legacy hardware requires them to use Filebeats, custom scripts and vendor APIs.
They make extensive use of Simple Network Management Protocol (SNMP) polling and traps in the data collection process because that remains an industry standard, despite its age. Specifically, they use SNMP to collect metrics from cable modems and Telegraf to collect time-stamped data from their virtual machines and containers. Using InfluxDB provided the team with the necessary flexibility to work around restrictions from vendor-managed systems, and they now collect data from all desired sources.
Next they stream the data to Kafka to better control data input and output. Kafka also allows them to easily consume or move data into different regions or systems, if necessary. From the Kafka cluster, they use Telegraf to send data to their InfluxDB Enterprise cluster.
WOW’s team aggregates various metrics from the fiber-to-the-node network, such as:
- Telemetry metrics, like usage and uptime, from over 650,000 cable modems on a five-minute polling cycle.
- Status of all television channels upstream and downstream, including audio and visual signal strength and outages.
- Average signal, port and power levels.
- Signal-to-noise ratio (SNR) — used to ensure the highest level of wireless functionality.
- Modulation error ratio (MER) — another measurement used to understand signal quality that factors in the amount of interference occurring on the transmission channel.
The WOW team uses all this data to gain insights from real-time analytics to create visualizations and to trigger alerts and troubleshoot processes. Once the data is in InfluxDB, they use Grafana for all their visualizations. They also leverage InfluxDB’s alerting frameworks to send alerts via ServiceNow, Slack and email. Adopting InfluxDB allowed the WOW team to implement an Infrastructure-as-Code (IaC) system, so instead of spending time manually managing their infrastructure, they can write config files to simplify processes.
WideOpenWest’s next big project is to implement a full CI/CD pipeline with automated code promotions. With this, they hope to improve automated testing. WOW also wants to streamline all monitoring across the organization and increase the level of infrastructure monitoring.