Monitoring Network Outages at the Edge and in the Cloud

Navigate to:

This article was originally published in The New Stack and is reposted here with permission.

Gathering data to explore a problem with power outages creating connectivity issues and ultimately draining a laptop battery.

Monitoring locations that have intermittent power and/or connectivity outages can be challenging. In this article, I’ll show how to use InfluxDB, an open source time series database, InfluxDB Cloud and Edge Data Replication to store data locally and send it to a central location whenever possible. I’ll also use Telegraf, an open source data collection agent, to retrieve some of the data needed for this purpose.

Connectivity issues

In this case, I’m sharing a small closet in a remote location with a laptop running Windows. This machine allows my friend to print and scan documents, and I use it to gather air quality data. The setup works fine most of the time; however, occasional power and Internet outages limit my friend’s ability to work with documents, as well as me collecting air quality data.

There could be multiple reasons for the failures. The first obvious problem could be with the computer shutting down. The most likely root cause here is power outages, but other factors could be at play too, such as other hardware or software issues. Additionally, there might be problems with the network equipment. The original setup included a local switch and router provided by the internet service provider (ISP).

I suspected that power outages caused the initial problem, which created the connectivity issues and ultimately resulted in the laptop running out of battery. However, I needed more data to validate my hypothesis.

I went about updating my setup for two reasons: 1) to avoid or minimize data loss in the event of a power outage, and 2) to gather more data to validate my hypothesis on the root cause. To that end, I decided to gather data on the laptop’s battery and its connectivity to the switch and the router.

Previous setup

Originally, the laptop ran Telegraf and sent data to InfluxDB Cloud. But I changed the setup to improve it and troubleshoot these connectivity issues. By default, Telegraf sends data when there is access to the internet, buffering the data in memory in the event of connection failures. However, if the network goes down and a power outage occurs, that drains the laptop’s battery, and I lose all my data.

Here’s the previous setup:

Monitoring-Network-Outages-at-the-Edge-and-in-the-Cloud-OG

The new setup

To remedy this situation, the laptop now runs a local instance of InfluxDB 2.0 (OSS). Telegraf, which used to output metrics directly to InfluxDB Cloud, now sends all of its data to the local instance of InfluxDB. I configured the local InfluxDB instance for Edge Data Replication, which allows me to set a bucket to automatically send the data in that bucket to an instance of InfluxDB Cloud, whenever there is connectivity.

This setup is more resilient to power outages. Telegraf now sends data to the local InfluxDB OSS instance, so it immediately persists on the laptop. OSS then sends that data to my InfluxDB Cloud account, where I can easily query the data over the internet and build alerts on top of it.

Here’s the new setup, including edge data replication:

The-New-Setup

The diagram also includes additional data to help troubleshoot the original issue.

Configuring the new setup

The laptop automatically logs in when powered on and starts multiple applications, including Telegraf.

I set up my InfluxDB OSS instance and the InfluxDB CLI following instructions on the InfluxDB download page and installed them to C:\Program Files\InfluxData\influxdb.

Then I created a new shortcut in the Startup folder (such as C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp or C:\Users\%USERNAME%\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup) to run cmd.exe /c “c:\Program Files\InfluxData\influxdb\influxdb2_windows_amd64\influxd.exe” and set the Run option to Minimized, which will start InfluxDB the next time the computer boots.

To get started setting up InfluxDB, I manually ran the shortcut. Using the InfluxDB CLI I installed earlier, I ran the influx setup command to set up the local instance. I provided my local user, organization ID, bucket name and password.

I already had an InfluxDB Cloud account so I created a new bucket there for all of my replicated data. I also created an API token in my cloud account that has write access to this new bucket. I’ll need the token later when setting up replication.

Next, I need to set up Edge Data Replication (EDR) so all the data in my local bucket replicates to the bucket I created in my cloud account.

At this point I need to create a “remote” that tells my local InfluxDB instance about the instance of InfluxDB Cloud where I want it to write data. For this remote, I need to provide a URL, the organization ID of my cloud account (a 16-digit hex number) and the API token I created earlier. The command to do this is:

influx remote create --name cloud --remote-url https://eu-central-1-1.aws.cloud2.influxdata.com --remote-api-token "...." --remote-org-id ....

In this case, my cloud account is on AWS in the EU central region. Obviously, this will differ for InfluxDB Cloud accounts in other regions or on different cloud providers.

IMPORTANT NOTE: The remote-org-id must be the organization ID of the InfluxDB Cloud account, NOT the local organization ID from the influx setup step.

Executing the remote command outputs information about the newly created remote. It’s important to copy the Remote ID (16-digit hex number) because we’ll need it in the next step.

At this point, we have everything we need to create a replication. Here, I specify a single bucket on my local InfluxDB OSS instance to replicate to my InfluxDB Cloud account. The command is:

influx replication create --name cloud-mybucket --remote-id .... --local-bucket-id ....--remote-bucket-id ....

The **remote-id** value must be the ID from the output of the **influx remote create** command above. The values for **local-bucket-id** and **remote-bucket-id** need to be the IDs (16-digit hex numbers) of the buckets in InfluxDB OSS and InfluxDB Cloud, respectively. You can find the bucket ID in the InfluxDB UI. Just go to the Load Data section and click on the Buckets tab.

Data replication in action

Having a more resilient setup allows me to gather data about outages and, hopefully, to get more data about the underlying problems.

So, in addition to the air quality data, I started gathering networking data too. I added the Ping Input plugin to my Telegraf to periodically check connectivity of:

  • The switch that connects the closet with other parts of the building (the switch has its own IP address)
  • The local IP address for the router that provides internet connection
  • The remote IP address of the Internet Service Provider

I updated the Ping Input Plugin configuration to the following:

[[inputs.ping]]
  urls = ["(switch IP)", "(ISP local ip)", "(ISP remote IP)"]
  count = 3
  timeout = 2.0
  deadline = 10
  interval = "300s"

This checks connectivity to all three IP addresses every five minutes and stores the results in the local instance of InfluxDB. EDR replicates that data to the cloud.

I also started gathering data about the power and battery state of the laptop. I can tell if the laptop is plugged into AC power, if the battery is draining and how much battery is left. Telegraf doesn’t provide a built-in way to retrieve this data on Windows; however, all I had to do was to write a PowerShell script that queries Windows Management Instrumentation (WMI) and reports this data to Telegraf.

This approach is quite generic, and you can use it to monitor any custom metric or data that you can query from PowerShell. Use the Exec Telegraf Input plugin to send those metrics to InfluxDB.

We can retrieve data that states whether the laptop is currently on AC power from the BatteryStatus class, using the PowerOnline property.

We can get data about the battery from the Win32_Battery WMI class and the EstimatedChargeRemaining property, which returns the percent of remaining battery life.

To be sure the script supports laptops with multiple batteries, I wrote it to iterate over all rows.

The PowerShell script for this is as follows:

# Determine if at least one battery reports that it is on AC power
$power = 0
Foreach ($row in @(Get-CimInstance -class "BatteryStatus" -namespace root\wmi)) {
  if ($row.PowerOnline -eq $true) {
    $power = 1
  }
}

# Calculate sum of percentage of all batteries and divide by number of batteries
$battery = 0
$count = 0

Foreach ($row in @(Get-CimInstance -classname Win32_Battery -property EstimatedChargeRemaining)) {
  $battery = $battery + $row.EstimatedChargeRemaining
  $count = $count + 1
}

$battery = $battery / $count

# Write results as very minimal implementation of line protocol
Write-Output ("batterystatus online={0}i,battery={1}" -f @($power,$battery))

You can call the script manually via PowerShell:

PS> PowerShell .\batterystatus.ps1
batterystatus online=1i,battery=100

NOTE: PowerShell execution policies may not allow the file to run. If this is the case, use the Unblock-File PowerShell command to allow a single file to run.

To add this script to Telegraf, we need to add an inputs.exec statement to its configuration, such as:

[[inputs.exec]]
  interval = "60s"
  commands = ["powershell C:/trex/batterystatus-telegraf.ps1"]
  data_format = "influx"

This runs the script every minute, stores the results locally and replicates them to InfluxDB Cloud.

Results

After several weeks of gathering data, and a few outages occurring, I was able to get meaningful data:

  • A power outage in the closet caused four out of five incidents detected with this setup.
  • In all of the power outages, the switch also stopped working. which was the root cause of internet connectivity issues.
  • Network issues at the ISP level cause one outage, where both local and remote IP addresses for the ISP did not respond to ICMP ping.
  • Two of four of the outages related to power lasted long enough to completely drain the laptop battery, causing it to shut down.
  • There were no issues where the laptop stopped gathering data without running out of power first — ruling out any issues with the computer or operating system.

Having this data made it much easier to determine next steps. The local switch and printer are now run behind an uninterrupted power supply (UPS), which resolved most of the issues and made my friend’s life easier.

As for the laptop shutting down, unfortunately, there’s no good solution. This laptop’s BIOS/UEFI settings do not provide a setting to automatically restart once AC power is restored.

As for me, I learned how to set up EDR and how to monitor Windows devices using WMI. My air quality data is also more complete as data replication prevents data loss in cases where the laptop shuts down.