Three Ways to Keep Cardinality Under Control When Using Telegraf

By Ignacio Van Droogenbroeck / Product, Use Cases, Developer
May 20, 2021

Navigate to:

This article will show how we kept cardinality under control with a few tweaks in the Telegraf configuration. If you’re not yet familiar with it, Telegraf is the native and open-source plugin-driver metrics collection agent of InfluxDB.

As you may know, cardinality is the combination of measurements, tags, sets, fields, and values in a time-series database, and having high cardinality can be a challenge. No worries, though — here are three ways (based on my previous SRE experience) to keep cardinality under control.

One of these is selecting the interval collection based on our needs, the specific metrics we need, and how long we need them.

Let’s take a closer look at these three tweaks and how they helped.

Data collection interval

When we needed to re-architect the monitoring system, we asked ourselves: What’s our SLA, and for what duration do we want to store the data? The default interval of Telegraf is ok, but was it too short for us? Do we need data from more than 100 servers hitting every 10 seconds? We didn’t, and based on our policies, receiving data every 60 seconds was enough. So that was one of the first things we tweaked.

Here’s our Telegraf configuration for this tweak.

[agent]
  interval = "60s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "60s"
  flush_jitter = "0s"
  precision = ""

  debug = false
  quiet = false
  logfile = ""

  hostname = ""
  omit_hostname = false

Adjusting the interval is an excellent way to keep cardinality under control but isn’t the only one. Let’s see the team’s (and my own) favorite way, and that is…

Filtering metrics

Filtering metrics can be another excellent way to keep cardinality under control; sometimes, depending on our needs and scenario, we don’t need all the metrics that an input plugin can offer. Take a look at all these fields collected when we have the Mem Telegraf Input Plugin running.

active (integer, Darwin, FreeBSD, Linux, OpenBSD)
available (integer)
available_percent (float)
buffered (integer, FreeBSD, Linux)
cached (integer, FreeBSD, Linux, OpenBSD)
commit_limit (integer, Linux)
committed_as (integer, Linux)
dirty (integer, Linux)
free (integer, Darwin, FreeBSD, Linux, OpenBSD)
high_free (integer, Linux)
high_total (integer, Linux)
huge_pages_free (integer, Linux)
huge_page_size (integer, Linux)
huge_pages_total (integer, Linux)
inactive (integer, Darwin, FreeBSD, Linux, OpenBSD)
laundry (integer, FreeBSD)
low_free (integer, Linux)
low_total (integer, Linux)
mapped (integer, Linux)
page_tables (integer, Linux)
shared (integer, Linux)
slab (integer, Linux)
sreclaimable (integer, Linux)
sunreclaim (integer, Linux)
swap_cached (integer, Linux)
swap_free (integer, Linux)
swap_total (integer, Linux)
total (integer)
used (integer)
used_percent (float)
vmalloc_chunk (integer, Linux)
vmalloc_total (integer, Linux)
vmalloc_used (integer, Linux)
wired (integer, Darwin, FreeBSD, OpenBSD)
write_back (integer, Linux)
write_back_tmp (integer, Linux)

It’s a lot, right? Ask yourself: do you need all these values? We didn’t because the data that was important to us was used_percent and available.

For the other input plugins, we asked ourselves the same question. So we filtered and collected only the data that mattered to us, and we actively monitored. Our configuration looks like this:

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false
  fieldpass = ["usage_guest", "usage_system", "usage_idle"]

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
  fieldpass = ["used, "free"]

[[inputs.diskio]]
  fieldpass = ["read_bytes", "write_bytes", "read_time", "read_write", "read", "writes"]

[[inputs.mem]]
  fieldpass = ["used_percent", "available"]

[[inputs.net]]
  fieldpass = ["bytes_sent", "bytes_recv", "err_in", "err_out", "drop_in", "drop_out"]

[[inputs.processes]]
  fieldpass = ["running", "zombie", "sleeping", "total"]

[[inputs.swap]]
  fieldpass = ["total", "used", "free"]

[[inputs.system]]
  fieldpass = ["load1", "load15", "load5", "uptime"]

This configuration reduces the fields of collection from 109 to 30. That’s a reduction of more than 72% of the data collected.

The other policy that we applied and helped control cardinality was…

Retention policy and downsampling

Customizing the database retention policy in InfluxDB is an excellent way to control cardinality. Ask yourself (one more time) if you need to have three years of data. Do you see yourself querying disk usage of a server that ran three years ago? Even one year ago?

In our case, we set the life of the data in accordance with PCI compliance and our internal policies. We need to retain the information about our systems for one year. For immediate analysis, we need to retain the data for the last 14 days. So we architected our monitoring system having, in our main buckets, the data for the previous 14 days, and every same quantity of days, we dump the older data into another bucket with a retention policy of one year.

To conclude...

As you can see, with a few tweaks in the Telegraf configuration, such as interval and the quantity of data that we collect, plus the retention and downsampling data available at the time, we were able to keep cardinality under control. What do you think of these tweaks, and what other tricks do you have to keep cardinality under control? Let us know through the comments below.

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Three Ways to Keep Cardinality Under Control When Using Telegraf

By Ignacio Van Droogenbroeck / Product, Use Cases, Developer
May 20, 2021

Navigate to:

Data collection interval

Filtering metrics

Retention policy and downsampling

To conclude...

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT:
A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Get Updates

Three Ways to Keep Cardinality Under Control When Using Telegraf

By Ignacio Van Droogenbroeck / Product, Use Cases, Developer May 20, 2021

Navigate to:

Data collection interval

Filtering metrics

Retention policy and downsampling

To conclude...

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT: A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us

By Ignacio Van Droogenbroeck / Product, Use Cases, Developer
May 20, 2021

InfluxDB for Industrial IoT:
A Live Demonstration