Don’t Let Time Series Data Break Your Relational Database
By Jessica Wachtel / May 15, 2023 / Developer, Community
This article was originally published in The New Stack and is reposted here with permission.
It’s tempting to stuff time series data into the familiar Postgres or MySQL database, but that’s a bad idea for many reasons.
To the uninitiated or unfamiliar, time series data exhibits similar characteristics to relational data, but the two data types have some critical differences. Relational data’s main objective is to maintain an accurate representation of the current state of the world, with respect to its objects and the relationships between them. Time series data tells the story of what’s happening in the world right now.
For example, think about the real-time insights and immediate signal/anomaly detection that DevOps engineers need. You can use the constant stream of observations to detect patterns, to find relevant information, to identify and remove noise and to uncover unexpected patterns that signal security threats. Time series data makes these insights possible. Sure, time series data can fit into the row/table format, but it’s better suited for a columnar database with the timestamp as its primary key.
Relational data vs. time series data
As the name implies, relational data is data that illustrates a relationship. The purpose of relational data is to maintain accurate records of objects and their relationships to each other. Relational data is transactional and updated frequently to maintain accuracy.
The purpose of time series data is to provide insight for analysis and summarization. A series is a stream of observations, so by nature the data points are related by source of origin, but the data points are immutable because the past cannot change. While a single point might not be useful, the series as a whole reveals how the source changes over time.
Relational databases are built for relational data
It might seem obvious, but relational databases are built for relational data. Time series data characteristics and workloads are very different, so a relational database doesn’t work for them.
Relational databases can’t handle the ingestion speeds of time series at scale. Because this is a problem related to scale, it only surfaces at scale. As a result, a lot of people start using a relational database for time series and end up having to do more work once they reach a scaling inflection point.
For every origin source stored in a relational database, an estimated 10 times more storage space is needed for its associated time series data. Relational databases aren’t built for this type of growth profile, nor are the features of relational databases needed for this type of data.
One example is that time series favors lower latency between reads and writes over database backups. When a relational database workload reaches the scalability tipping point, write speeds slow down as the database backs up as a safety precaution. The higher latencies impede automated systems’ ability to act immediately on any irregularities.
Another challenge with relational databases is their lack of flexibility because of explicit schema requirements. The database must undergo a labor-intensive migration whenever you need to update the schema. This is a risky undertaking because it is possible to lose or corrupt data no matter how careful developers are during the process.
Time series databases are built for time series data
InfluxDB is a purpose-built time series database, delivered via cloud, on premises and open source. It is designed to meet the needs of time series data. In terms of scaling, in InfluxData’s internal benchmarking, InfluxDB ingests orders of magnitude more data per second using significantly less CPU and memory than other databases, even those that claim to be tuned for time series.
InfluxDB is “schema on write,” meaning developers can add new dimensions and fields by simply adding them to their writes. There are no change requirements to any production or development databases. This offers flexibility for workloads with changing data shapes.
Apache Arrow for time series
Time series is all about understanding the current picture of the world and offering immediate insight and action. Relational databases can perform basic data manipulation, but they can’t execute advanced calculations and analytics on multiple observations.
Because time series data workloads are so large, they need a database that can work with large datasets easily. Apache Arrow is specifically designed to move large amounts of columnar data. Building a database on Arrow gives developers more options to effectively operate on their data by way of advanced data analysis and the implementation of machine learning and artificial intelligence tools such as Pandas.
Some may be tempted to simply use Arrow as an external tool for a current solution. However, this approach isn’t workable because if the database doesn’t return data in Arrow format right from the source, the production application will struggle to ensure there’s enough memory to work with large datasets. The code source will also lack the compression Arrow provides. Transferring the poorly compressed bytes across the wire increases latencies between the database and code, which negatively affects overall performance.
Shrinking the learning curve
Building InfluxDB on the Apache ecosystem created an opportunity to add SQL support into the time series database. InfluxDB uses DataFusion as its query engine, and DataFusion uses SQL as the query language, meaning anyone who knows SQL can now query time series. There’s no additional language requirement.
To further enhance ease of access, there are already three time series-specific functions in DataFusion. These are all open source, so anyone within the Apache Arrow community can benefit from or contribute to them.
1· date_bin() – Creates rows that are time windows of data with an aggregate. 2· selector_first(), selector_last() – Provide the first or last row of a table that meet specific criteria. 3· time_bucket_gapfill() – Returns windowed data, and if there are windows that lack data it will fill those gaps.
Time series data has different characteristics, storage requirements and workloads than relational data. Because the data types appear similar, it’s important to be aware of these differences early in the process. The later into production these issues are identified, the harder they are to solve.
Time series data works best with a time series database like InfluxDB to account for low latency at high ingestion rates, the flexibility of schema on write data collection and advanced data analysis. Native SQL support in InfluxDB makes time series data workloads more accessible to SQL users.
You can avoid or fix any of the pitfalls outlined above by simply adding a time series database to your tech stack.