Three Ways to Keep Cardinality Under Control When Using Telegraf
By Ignacio Van Droogenbroeck / May 20, 2021 / InfluxDB, Community, Telegraf, Developer
This article will show how we kept cardinality under control with a few tweaks in the Telegraf configuration. If you’re not yet familiar with it, Telegraf is the native and open-source plugin-driver metrics collection agent of InfluxDB.
As you may know, cardinality is the combination of measurements, tags, sets, fields, and values in a time-series database, and having high cardinality can be a challenge. No worries, though — here are three ways (based on my previous SRE experience) to keep cardinality under control.
One of these is selecting the interval collection based on our needs, the specific metrics we need, and how long we need them.
Let’s take a closer look at these three tweaks and how they helped.
Data collection interval
When we needed to re-architect the monitoring system, we asked ourselves: What’s our SLA, and for what duration do we want to store the data? The default interval of Telegraf is ok, but was it too short for us? Do we need data from more than 100 servers hitting every 10 seconds? We didn’t, and based on our policies, receiving data every 60 seconds was enough. So that was one of the first things we tweaked.
Here’s our Telegraf configuration for this tweak.
[agent] interval = "60s" round_interval = true metric_batch_size = 1000 metric_buffer_limit = 10000 collection_jitter = "0s" flush_interval = "60s" flush_jitter = "0s" precision = "" debug = false quiet = false logfile = "" hostname = "" omit_hostname = false
Adjusting the interval is an excellent way to keep cardinality under control but isn’t the only one. Let’s see the team’s (and my own) favorite way, and that is…
Filtering metrics can be another excellent way to keep cardinality under control; sometimes, depending on our needs and scenario, we don’t need all the metrics that an input plugin can offer. Take a look at all these fields collected when we have the Mem Telegraf Input Plugin running.
active (integer, Darwin, FreeBSD, Linux, OpenBSD) available (integer) available_percent (float) buffered (integer, FreeBSD, Linux) cached (integer, FreeBSD, Linux, OpenBSD) commit_limit (integer, Linux) committed_as (integer, Linux) dirty (integer, Linux) free (integer, Darwin, FreeBSD, Linux, OpenBSD) high_free (integer, Linux) high_total (integer, Linux) huge_pages_free (integer, Linux) huge_page_size (integer, Linux) huge_pages_total (integer, Linux) inactive (integer, Darwin, FreeBSD, Linux, OpenBSD) laundry (integer, FreeBSD) low_free (integer, Linux) low_total (integer, Linux) mapped (integer, Linux) page_tables (integer, Linux) shared (integer, Linux) slab (integer, Linux) sreclaimable (integer, Linux) sunreclaim (integer, Linux) swap_cached (integer, Linux) swap_free (integer, Linux) swap_total (integer, Linux) total (integer) used (integer) used_percent (float) vmalloc_chunk (integer, Linux) vmalloc_total (integer, Linux) vmalloc_used (integer, Linux) wired (integer, Darwin, FreeBSD, OpenBSD) write_back (integer, Linux) write_back_tmp (integer, Linux)
It’s a lot, right? Ask yourself: do you need all these values? We didn’t because the data that was important to us was used_percent and available.
For the other input plugins, we asked ourselves the same question. So we filtered and collected only the data that mattered to us, and we actively monitored. Our configuration looks like this:
[[inputs.cpu]] percpu = true totalcpu = true collect_cpu_time = false report_active = false fieldpass = ["usage_guest", "usage_system", "usage_idle"] [[inputs.disk]] ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"] fieldpass = ["used, "free"] [[inputs.diskio]] fieldpass = ["read_bytes", "write_bytes", "read_time", "read_write", "read", "writes"] [[inputs.mem]] fieldpass = ["used_percent", "available"] [[inputs.net]] fieldpass = ["bytes_sent", "bytes_recv", "err_in", "err_out", "drop_in", "drop_out"] [[inputs.processes]] fieldpass = ["running", "zombie", "sleeping", "total"] [[inputs.swap]] fieldpass = ["total", "used", "free"] [[inputs.system]] fieldpass = ["load1", "load15", "load5", "uptime"]
This configuration reduces the fields of collection from 109 to 30. That’s a reduction of more than 72% of the data collected.
The other policy that we applied and helped control cardinality was…
Retention policy and downsampling
Customizing the database retention policy in InfluxDB is an excellent way to control cardinality. Ask yourself (one more time) if you need to have three years of data. Do you see yourself querying disk usage of a server that ran three years ago? Even one year ago?
In our case, we set the life of the data in accordance with PCI compliance and our internal policies. We need to retain the information about our systems for one year. For immediate analysis, we need to retain the data for the last 14 days. So we architected our monitoring system having, in our main buckets, the data for the previous 14 days, and every same quantity of days, we dump the older data into another bucket with a retention policy of one year.
As you can see, with a few tweaks in the Telegraf configuration, such as interval and the quantity of data that we collect, plus the retention and downsampling data available at the time, we were able to keep cardinality under control. What do you think of these tweaks, and what other tricks do you have to keep cardinality under control? Let us know through the comments below.