TL;DR InfluxDB Tech Tips – Visualizing Uptime with Flux deadman() Function in InfluxDB Dashboards

Navigate to:

A common DevOps use case involves alerting when hosts stop reporting metrics, aka a deadman alert. This can be done using the monitor.deadman() Flux function. One can easily create a deadman (or threshold) check in the InfluxDB UI Alerts section or craft a custom task to alert as well. Check out InfluxDB’s Checks and Notifications system post for more details. It’s also possible to use the monitor.deadman() function directly in a dashboard cell.

What the deadman() function does

The deadman function keeps the most recent row for each group (host), adding a dead column that is set to false if the most recent record happened after the time period to check for the deadman, or true if it happened before this time period, indicating a deadman.

Before going over the Flux query, it’s necessary to explain why using the deadman() function is required. Without it, any host that stopped writing in would not be included in the table that is returned.

Writing a Flux query to use in a dashboard cell

Let’s say we want to create a dashboard cell that displays a list of hosts and the current uptime or “offline” if it’s not returning metrics.

First we import two packages:

import "influxdata/influxdb/monitor"
import "experimental"

The monitor package is used in the monitor.deadman() function, and the experimental package is used in the experimental.subDuration() function to request a time period in the past that we want to check for.

Next, we bring back tables with the latest point for each host within the past 7 days:

from(bucket: "bucket_name")
    |> range(start: -7d)
    |> filter(fn: (r) => r["_measurement"] == "system")
    |> filter(fn: (r) => r["_field"] == "uptime")
    |> group(columns: ["host"]
    |> last()

The returned data includes the latest record for each host with the uptime field _value and _time columns:

Visualizing uptime using Flux query and scripting language

In order to check whether the host is a deadman; i.e., hasn’t written any data within the past hour, we’ll call the monitor.deadman() function:

|> monitor.deadman(t: experimental.subDuration(d: 1h, from: now()))

Notice the additional dead column:

Flux - monitor.deadman function

Hosts with a dead record set to false indicate they are actively sending in data. Likewise, hosts set to true indicate a deadman; i.e., no data was written in the past hour.

In order to make it more understandable in a dashboard cell, we add a map() to display “offline” hosts, otherwise, display the current uptime:

|> map(fn: (r) => ({ r with _value: if r.dead == true then "Offline" else string(v: r._value) }))

To display a more understandable uptime duration, we can convert seconds to days or hours:

|> map(fn: (r) => ({ r with _value: if r.dead == true then "Offline" else if 
                                           r._value > 86400 then 
                                                string(v: r._value / 86400) + " days" else
                                                string(v: r._value / 3600) + " hours"

The dashboard cell looks like:

Displaying uptime duration

The complete query is:

import "influxdata/influxdb/monitor"
import "experimental"

from(bucket: "bucket_name")
    |> range(start: -7d)
    |> filter(fn: (r) => r["_measurement"] == "system")
    |> filter(fn: (r) => r["_field"] == "uptime")
    |> monitor.deadman(t: experimental.subDuration(d: 1h, from: now()))
    |> map(fn: (r) => ({ r with uptime: if r.dead == true then "Offline" else if 
                                           r._value > 86400 then 
                                                string(v: r._value / 86400) + " days" else
                                                string(v: r._value / 3600) + " hours"
                       }))
    |> group()
    |> keep(columns: ["uptime", "host"])

Note that last() is not required as monitor.deadman() returns the latest point for each host. The group() ungroups the data so it’s in one table for display purposes. To make it sortable, another column can be added with the original uptime value, adding 0 instead of “offline” with conditional logic:

|> map(fn: (r) => ({ r with uptime_sort: if r.dead == true then 0 else r._value }))

Conclusion on using the Flux deadman() function in InfluxDB Dashboards

I hope this post helps you realize that the Flux monitor.deadman() function is not just for alerts and tasks but can be used in dashboards as well.

If you are new to Flux, or are migrating your InfluxQL queries to Flux and need help, please ask on our  community site or Slack  channel. If you’re developing a cool IoT application on top of InfluxDB, we’d love to hear about it, so make sure to share your story! Additionally, please share your thoughts, concerns or questions in the comments section. We’d love to get your feedback and help you with any problems you run into!