InfluxDB Internals 101 - Part Two
Ryan Betts /
Nov 27, 2017
- Query path: reading data from InfluxDB
- Indexing points for query
- A note on TSI (on disk indexes)
- Executing queries
- A note on IFQL
- DELETE and DROP - removing data from InfluxDB
- Updating points
Part One of this series describes the InfluxDB write path: how the database persists and organizes data being written to the database. This part (Part Two) describes the other main interaction with the database: querying data once it has been persisted. Note that Part One also defines the InfluxDB jargon used in this post (
series) which will be helpful to new readers.
InfluxDB is queried using a SQL dialect called
influxql. There is quite a bit of documentation for the language as well as a guide to using
influxql for different querying tasks. This post focuses on how the query engine works and not on the semantics of the language itself.
Time series applications tend to query in two patterns. Queries either window and produce per-window aggregates (window data into one-minute intervals and calculate the average for each minute). Or, queries search for a specific point (often the
last() or most recent point in a series). Both query patterns filter the points in the database by criteria applied to a set of dimensions; for example, all the data where
region = us-east or where
measurement = 'cpu'. In InfluxDB, these dimensions are stored as
Finally, before we get into more detail, it is important to note that
projection operators but does not support traditional relational
joins. Optimizing query performance in InfluxDB requires finding the initial point for each series and then leveraging columnar storage to efficiently scan a sequence of points following that initial point. The use of flexible schema-on-write
tagsets vs. pre-defined dimension tables in a star-schema is one of the more interesting differences between InfluxDB and a traditional SQL columnar OLAP database.
Indexing Points for Query
Part One describes the different data structures populated by incoming writes to achieve durability and compact long-term storage. There is one additional data structure populated by writes to make queries efficient: the index. InfluxDB automatically maintains an index to make filtering by
The index maintains mappings of
measurement name to
field keys, of
measurement name to
series ids (an internal series identifier), of
measurement name to
tag keys to
tag value to
series id, and of
series id to
shards. The index (as of version 1.4) also maintains sketches of
measurements for fast cardinality estimates. You can read the index implementation on GitHub for more detail.
That’s a lot of different mappings to think about and understand. Personally, I find it easier, and conceptually accurate, to think of the index as a posting list (aka inverted index) that maps tag key/value pairs to a list of series keys. This slight abstraction captures the primary purpose of the index: to make it efficient at query time to identify all series that need to be scanned based on a
tagset filter in an
influxql WHERE predicate.
A Note on TSI (On-disk Index)
The current default index is stored in-memory. This allows fast lookup for query planning. However, it also means that high-cardinality data, data that include a large number of unique
tagsets, requires a lot of memory to index. This is why we suggest that users use
tagsets for lower-cardinality dimension data and use unindexed
field values for high-cardinality data.
We are developing a new index structure, Time Series Index (TSI), which is now shipping as an opt-in preview. TSI stores the index on SSD, allowing much higher cardinality datasets than the default in-memory index.
Parsing and Planning
Having described the index, it is possible to explain the internal workflow that runs to parse, plan, and execute an example
influxql query. The query engine:
- Determines the type of query (one with an expression or a raw data query)
- Determines and then separates the time range and the condition expression for filtering data
- Determines which shards it needs to access using the list of measurements and the time frame
- Expands any wildcards
- Validates that the query is semantically correct
- Directs the storage engine to create the iterators for each shard
- And merges the shard iterator outputs, performing any post-processing on the data
select user, system from cpu where time > now() - 1h and host = 'serverA'
The database receives the query and parses out the measurements that are accessed, fields returned, grouping time intervals, filter predicates, and other
influxql query components. You can read the AST structure for the SELECT statement in the influxdata/influxql GitHub repository.
After parsing, the query engine determines which series are needed to produce an answer. In this example, the query engine uses the index to find all
series that are part of the cpu
measurement. It then uses the index to find all
series that have the
tag key, tag value pair host, serverA. The intersection of these sets provide the
series that need to be scanned. The time range in the query, now() - 1h, limits the scan to
shard groups covering the last one hour.
The query engine instantiates an iterator for each series, for each shard. These iterators are nested, forming a tree. The iterator tree is executed bottom-up, reading, filtering, and merging data to produce a final result set.
The version 1.4
EXPLAIN ANALYZE statements provide statistics on iterators created and TSM blocks decoded as part of query execution. There are example outputs in the What’s New in InfluxDB 1.4 blog post.
A Note on IFQL
The combination of schema-on-write, automatic indexing of
tagsets, and SQL-like syntax produce a system that allows newcomers to be productive quickly, that feels familiar, and requires minimal setup to get started.
However, the pre-allocation of narrowly scoped iterators means high-cardinality queries, and queries that produce a very large number of groups are expensive to plan. The iterator structures can consume, worst case, GBs of RAM. Secondly, the iterator allocation during planning and other implementation details make multi-query resource management difficult. Finally, while SQL-like syntax is a good fit for simple queries, it becomes cumbersome for more sophisticated analytics. Time series queries are often sets of functions applied to groupings of filtered streams. Expressing these queries using select-project-join logic with advanced SQL partition and over clauses requires an experienced SQL programmer and is no longer beginner- friendly.
We recently announced a prototype query language, IFQL, to explore solutions to these problems: cheaper planning, better resource management, and easier expression of complex queries.
DELETE and DROP: Removing Data from InfluxDB
InfluxDB supports retention policies to enforce time to live policies against data. This is always the preferred way to regularly delete points from the database. However, applications sometimes write bad data to the database. That data needs to be removed to return to normal operation. In these cases,
DROP can be used to delete unwanted points.
DELETE and DROP statements are processed through the query layer, not the write layer. This allows DELETE and DROP to re-use the selection and expression features of
Deleting data from a columnar database is expensive. InfluxDB organizes data on disk into immutable runs of values for a single column of a series. A delete operation needs to undo a lot of that work for a subset of points.
In InfluxDB, deleting a row from the database produces a tombstone. A tombstone includes a
series key and the min and max time of the deleted range. This allows a very compact expression for the primary delete use case: delete all data for an invalid series between times t1 and t2.
When sufficient tombstones collect, TSM data is re-compacted into a new immutable file with the deleted data removed and tombstone records deleted. At query time, tombstones are checked to avoid processing data marked as deleted.
Over the last six months, substantial work has gone into making tombstone management, compaction based on accumulated deletes, and index updates after deletes, correct and efficient.
InfluxDB does not support an
UPDATE statement. However, re-inserting a fully qualified
series key at an existing timestamp will replace the old point’s
field value with the new
Hopefully this post has added to your mental model of InfluxDB. It discusses four key concepts:
tagsetsare indexed for query planning.
- Query planning uses the index to identify series to scan.
- Query planning generates and executes a tree of iterators.
- DELETE and DROP statements are part of
influxqland result in tombstones to annotate deleted data.