Time Series Data, Cardinality, and InfluxDB
By Jason Myers / Mar 15, 2023 / InfluxDB
What is cardinality?
In the world of databases, cardinality refers to the number of unique sets of data stored in a database. If we drill down a little further, we can think of cardinality as the total number of unique values possible within a table column or database equivalent.
When thinking about time series data, we can ask some specific questions about cardinality. What does cardinality look like in practice? When does cardinality become a problem? How do we prevent cardinality issues?
This article looks at these questions, and more. We’ll examine the root of the cardinality problem and, ultimately, discuss how InfluxDB’s IOx database engine solves the cardinality problem for time series data.
Cardinality in InfluxDB
To understand how cardinality issues occur, we first need to understand the InfluxDB data model, called line protocol.
There are four components to line protocol:
- Measurement: This is equivalent to a table in a relational database.
- Tags: These are metadata consisting of key-value pairs that contextualize your data. InfluxDB indexes tags, and tag values can be strings.
- Fields: Fields are key-value pairs for the actual data points you’re collecting. Fields can be integers, floats, strings, or Booleans.
Each unique combination of measurement and tag set creates a series in InfluxDB. If you have a lot of these unique combinations, then you have high cardinality. Now, it’s important to remember that high cardinality is not, by itself, an issue.
The reason you may have a lot of measurement/tag combinations is because your tag values contain unbounded data.
Let’s look at an example. In a traditional network monitoring use case, let’s say we’re monitoring some server racks. We want to identify where the servers are geographically, so we may end up with a tag along the lines of
location=CA1. We also need to track each individual server box, including unique IP addresses associated with each unit. If we have a tag key
ip the tag values could be
22.214.171.124, resulting key-value pairs
ip=126.96.36.199, etc. Each unique tag value potentially creates a new measurement/tag set combination, increasing the cardinality of the data set.
So why does this matter? InfluxDB’s TSM engine stores series keys in an in-memory index that routinely persists to disk. So, your hardware determines your cardinality threshold because once that disk fills up, it negatively affects database performance.
Schema and cardinality
In order to avoid cardinality issues, it’s necessary to consider your data schema.
Now, InfluxDB is a schema-on-write database. This means that you don’t have to create a structured schema before putting data into InfluxDB. If your data follows line protocol, InfluxDB creates a schema automatically. If you introduce a new device/sensor/source into your process that brings with it a new tag or field key, InfluxDB automatically adjusts the schema to incorporate the new data.
It’s somewhat ironic then, that a database with so much flexibility in terms of data shape forces you to think about cardinality. Still, this is very different from relational and other databases, which require users to define a schema before you can write data into them. If the shape of your data changes, in these solutions, updating the schema can be a major ordeal.
But if you want to adjust your schema in InfluxDB, what does that look like? Well, this typically comes down to what values you assign as tags and what values you assign as fields. As mentioned above, tag values are of the type string only. If you have data that generates unbounded tag values, that contributes to high cardinality. Some use cases, like tracing and logs, tend to produce this type of high cardinality data.
At the same time, it might not make sense for you to change a tag value to a field value. What then?
Solving the cardinality problem
InfluxDB has always handled metrics well, but use cases like traces could be hit or miss because of the cardinality issue. InfluxDB’s TSM engine stored each series as a column on disk. In order to open up InfluxDB to the full range of time series use cases, we rebuilt the core database engine.
We built InfluxDB, powered by IOx as a columnar datastore, using a variety of open source tools (Apache Arrow, Apache Parquet, and more) to design a database that can ingest high volume, high cardinality time series data without impacting performance. Instead of storing each series in a column, the IOx engine stores each tag and field as a column. This significantly reduces the total number of columns, which improves performance.
With this new technology powering InfluxDB, users can now expect consistent performance across time series workloads, whether they’re working with metrics, raw event data, traces, or logs.
There are still some schema considerations with the IOx engine, but the basic questions surrounding them are much more straightforward: is your schema too wide or too sparse?
A wide schema has lots of columns, which can affect performance. That was the basis for the whole cardinality issue in the first place. To ensure queries stay performant, the InfluxDB IOx storage engine has a limit of 200 columns per measurement. A quick fix for a wide schema is to split your data into multiple measurements. For example, if you’re collecting data from machines in a factory, instead of dumping all the data from the factory floor into a single measurement, create one measurement for each individual machine.
A sparse schema is one where many rows contain null values. This forces the query engine to evaluate all those null columns at execution time, adding unnecessary overhead to storing and querying data.
For a long time, cardinality was the proverbial ‘rock-in-the-shoe’ for InfluxDB. Sure, it still ran, but not as comfortably as it could. With the InfluxDB IOx engine, performance is front and center, and with cardinality no longer the problem it once was, InfluxDB can ingest and analyze large workloads in real-time.
So, whether you’re a seasoned InfluxDB pro, or just discovering it, try out the new InfluxDB IOx engine to see how it can accelerate your Time to Awesome.