InfluxDB Internals 101 - Part Two

Navigate to:

  • Query path: reading data from InfluxDB
    • Indexing points for query
    • A note on TSI (on disk indexes)
    • Executing queries
    • A note on IFQL
    • DELETE and DROP - removing data from InfluxDB
    • Updating points

Introduction

Part One of this series describes the InfluxDB write path: how the database persists and organizes data being written to the database. This part (Part Two) describes the other main interaction with the database: querying data once it has been persisted. Note that Part One also defines the InfluxDB jargon used in this post (tagset, fieldset, measurement, series) which will be helpful to new readers.

InfluxDB is queried using a SQL dialect called influxql. There is quite a bit of documentation for the language as well as a guide to using influxql for different querying tasks. This post focuses on how the query engine works and not on the semantics of the language itself.

Time series applications tend to query in two patterns. Queries either window and produce per-window aggregates (window data into one-minute intervals and calculate the average for each minute). Or, queries search for a specific point (often the last() or most recent point in a series). Both query patterns filter the points in the database by criteria applied to a set of dimensions; for example, all the data where region = us-east or where measurement = 'cpu'. In InfluxDB, these dimensions are stored as tagsets.

Finally, before we get into more detail, it is important to note that influxql supports selection and projection operators but does not support traditional relational joins. Optimizing query performance in InfluxDB requires finding the initial point for each series and then leveraging columnar storage to efficiently scan a sequence of points following that initial point. The use of flexible schema-on-write tagsets vs. pre-defined dimension tables in a star-schema is one of the more interesting differences between InfluxDB and a traditional SQL columnar OLAP database.

Indexing Points for Query

Part One describes the different data structures populated by incoming writes to achieve durability and compact long-term storage. There is one additional data structure populated by writes to make queries efficient: the index. InfluxDB automatically maintains an index to make filtering by tagsets efficient.

The index maintains mappings of measurement name to field keys, of measurement name to series ids (an internal series identifier), of measurement name to tag keys to tag value to series id, and of series id to shards. The index (as of version 1.4) also maintains sketches of series and measurements for fast cardinality estimates. You can read the index implementation on GitHub for more detail.

That’s a lot of different mappings to think about and understand. Personally, I find it easier, and conceptually accurate, to think of the index as a posting list (aka inverted index) that maps tag key/value pairs to a list of series keys. This slight abstraction captures the primary purpose of the index: to make it efficient at query time to identify all series that need to be scanned based on a tagset filter in an influxql WHERE predicate.

A Note on TSI (On-disk Index)

The current default index is stored in-memory. This allows fast lookup for query planning. However, it also means that high-cardinality data, data that include a large number of unique tagsets, requires a lot of memory to index. This is why we suggest that users use tagsets for lower-cardinality dimension data and use unindexed field values for high-cardinality data.

We are developing a new index structure, Time Series Index (TSI), which is now shipping as an opt-in preview. TSI stores the index on SSD, allowing much higher cardinality datasets than the default in-memory index.

Parsing and Planning

Having described the index, it is possible to explain the internal workflow that runs to parse, plan, and execute an example influxql query. The query engine:

  • Determines the type of query (one with an expression or a raw data query)
  • Determines and then separates the time range and the condition expression for filtering data
  • Determines which shards it needs to access using the list of measurements and the time frame
  • Expands any wildcards
  • Validates that the query is semantically correct
  • Directs the storage engine to create the iterators for each shard
  • And merges the shard iterator outputs, performing any post-processing on the data

Sample query: select user, system from cpu where time > now() - 1h and host = 'serverA'

The database receives the query and parses out the measurements that are accessed, fields returned, grouping time intervals, filter predicates, and other influxql query components. You can read the AST structure for the SELECT statement in the influxdata/influxql GitHub repository.

After parsing, the query engine determines which series are needed to produce an answer. In this example, the query engine uses the index to find all series that are part of the cpu measurement. It then uses the index to find all series that have the tag key, tag value pair host, serverA. The intersection of these sets provide the series that need to be scanned. The time range in the query, now() - 1h, limits the scan to shard groups covering the last one hour.

The query engine instantiates an iterator for each series, for each shard. These iterators are nested, forming a tree. The iterator tree is executed bottom-up, reading, filtering, and merging data to produce a final result set.

The version 1.4 EXPLAIN and EXPLAIN ANALYZE statements provide statistics on iterators created and TSM blocks decoded as part of query execution. There are example outputs in the What’s New in InfluxDB 1.4 blog post.

A Note on IFQL

The combination of schema-on-write, automatic indexing of tagsets, and SQL-like syntax produce a system that allows newcomers to be productive quickly, that feels familiar, and requires minimal setup to get started.

However, the pre-allocation of narrowly scoped iterators means high-cardinality queries, and queries that produce a very large number of groups are expensive to plan. The iterator structures can consume, worst case, GBs of RAM. Secondly, the iterator allocation during planning and other implementation details make multi-query resource management difficult. Finally, while SQL-like syntax is a good fit for simple queries, it becomes cumbersome for more sophisticated analytics. Time series queries are often sets of functions applied to groupings of filtered streams. Expressing these queries using select-project-join logic with advanced SQL partition and over clauses requires an experienced SQL programmer and is no longer beginner- friendly.

We recently announced a prototype query language, IFQL, to explore solutions to these problems: cheaper planning, better resource management, and easier expression of complex queries.

DELETE and DROP: Removing Data from InfluxDB

InfluxDB supports retention policies to enforce time to live policies against data. This is always the preferred way to regularly delete points from the database. However, applications sometimes write bad data to the database. That data needs to be removed to return to normal operation. In these cases, DELETE and DROP can be used to delete unwanted points.

DELETE and DROP statements are processed through the query layer, not the write layer. This allows DELETE and DROP to re-use the selection and expression features of influxql.

Deleting data from a columnar database is expensive. InfluxDB organizes data on disk into immutable runs of values for a single column of a series. A delete operation needs to undo a lot of that work for a subset of points.

In InfluxDB, deleting a row from the database produces a tombstone. A tombstone includes a series key and the min and max time of the deleted range. This allows a very compact expression for the primary delete use case: delete all data for an invalid series between times t1 and t2.

When sufficient tombstones collect, TSM data is re-compacted into a new immutable file with the deleted data removed and tombstone records deleted. At query time, tombstones are checked to avoid processing data marked as deleted.

Over the last six months, substantial work has gone into making tombstone management, compaction based on accumulated deletes, and index updates after deletes, correct and efficient.

Updating Points

InfluxDB does not support an UPDATE statement. However, re-inserting a fully qualified series key at an existing timestamp will replace the old point’s field value with the new field value.

Conclusion

Hopefully this post has added to your mental model of InfluxDB. It discusses four key concepts:

  • series and tagsets are indexed for query planning.
  • Query planning uses the index to identify series to scan.
  • Query planning generates and executes a tree of iterators.
  • DELETE and DROP statements are part of influxql and result in tombstones to annotate deleted data.