Three Ways to Keep Cardinality Under Control When Using Telegraf

Navigate to:

This article will show how we kept cardinality under control with a few tweaks in the Telegraf configuration. If you’re not yet familiar with it, Telegraf is the native and open-source plugin-driver metrics collection agent of InfluxDB.

As you may know, cardinality is the combination of measurements, tags, sets, fields, and values in a time-series database, and having high cardinality can be a challenge. No worries, though — here are three ways (based on my previous SRE experience) to keep cardinality under control.

One of these is selecting the interval collection based on our needs, the specific metrics we need, and how long we need them.

Let’s take a closer look at these three tweaks and how they helped.

Data collection interval

When we needed to re-architect the monitoring system, we asked ourselves: What’s our SLA, and for what duration do we want to store the data? The default interval of Telegraf is ok, but was it too short for us? Do we need data from more than 100 servers hitting every 10 seconds? We didn’t, and based on our policies, receiving data every 60 seconds was enough. So that was one of the first things we tweaked.

Here’s our Telegraf configuration for this tweak.

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""

  debug = false
  quiet = false
  logfile = ""

  hostname = ""
  omit_hostname = false

Adjusting the interval is an excellent way to keep cardinality under control but isn’t the only one. Let’s see the team’s (and my own) favorite way, and that is…

Filtering metrics

Filtering metrics can be another excellent way to keep cardinality under control; sometimes, depending on our needs and scenario, we don’t need all the metrics that an input plugin can offer. Take a look at all these fields collected when we have the Mem Telegraf Input Plugin running.

active (integer, Darwin, FreeBSD, Linux, OpenBSD)
available (integer)
available_percent (float)
buffered (integer, FreeBSD, Linux)
cached (integer, FreeBSD, Linux, OpenBSD)
commit_limit (integer, Linux)
committed_as (integer, Linux)
dirty (integer, Linux)
free (integer, Darwin, FreeBSD, Linux, OpenBSD)
high_free (integer, Linux)
high_total (integer, Linux)
huge_pages_free (integer, Linux)
huge_page_size (integer, Linux)
huge_pages_total (integer, Linux)
inactive (integer, Darwin, FreeBSD, Linux, OpenBSD)
laundry (integer, FreeBSD)
low_free (integer, Linux)
low_total (integer, Linux)
mapped (integer, Linux)
page_tables (integer, Linux)
shared (integer, Linux)
slab (integer, Linux)
sreclaimable (integer, Linux)
sunreclaim (integer, Linux)
swap_cached (integer, Linux)
swap_free (integer, Linux)
swap_total (integer, Linux)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer, Linux)
vmalloc_total (integer, Linux)
vmalloc_used (integer, Linux)
wired (integer, Darwin, FreeBSD, OpenBSD)
write_back (integer, Linux)
write_back_tmp (integer, Linux)

It’s a lot, right? Ask yourself: do you need all these values? We didn’t because the data that was important to us was used_percent and available.

For the other input plugins, we asked ourselves the same question. So we filtered and collected only the data that mattered to us, and we actively monitored. Our configuration looks like this:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  fieldpass = ["usage_guest", "usage_system", "usage_idle"]

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
  fieldpass = ["used, "free"]

[[inputs.diskio]]
  fieldpass = ["read_bytes", "write_bytes", "read_time", "read_write", "read", "writes"]

[[inputs.mem]]
  fieldpass = ["used_percent", "available"]

[[inputs.net]]
  fieldpass = ["bytes_sent", "bytes_recv", "err_in", "err_out", "drop_in", "drop_out"]

[[inputs.processes]]
  fieldpass = ["running", "zombie", "sleeping", "total"]

[[inputs.swap]]
  fieldpass = ["total", "used", "free"]

[[inputs.system]]
  fieldpass = ["load1", "load15", "load5", "uptime"]

This configuration reduces the fields of collection from 109 to 30. That’s a reduction of more than 72% of the data collected.

The other policy that we applied and helped control cardinality was…

Retention policy and downsampling

Customizing the database retention policy in InfluxDB is an excellent way to control cardinality. Ask yourself (one more time) if you need to have three years of data. Do you see yourself querying disk usage of a server that ran three years ago? Even one year ago?

In our case, we set the life of the data in accordance with PCI compliance and our internal policies. We need to retain the information about our systems for one year. For immediate analysis, we need to retain the data for the last 14 days. So we architected our monitoring system having, in our main buckets, the data for the previous 14 days, and every same quantity of days, we dump the older data into another bucket with a retention policy of one year.

To conclude...

As you can see, with a few tweaks in the Telegraf configuration, such as interval and the quantity of data that we collect, plus the retention and downsampling data available at the time, we were able to keep cardinality under control. What do you think of these tweaks, and what other tricks do you have to keep cardinality under control? Let us know through the comments below.