Data Modeling: Part 2 — Method for Time Series Databases

Navigate to:

This article was originally published in The New Stack and is reposted here with permisison. You can find Part 1 of this series here.

Time-varying entities may contain multiple time-varying and static attributes, making mapping them a particular challenge.

Time is notorious in modeling tasks. Indeed, the temporal aspect exacerbates the complexity of the modeling task, making simple diagrams look pretty complex. The temporal dimension becomes particularly nasty when it takes part in identifying entities.

Employees

The figure on the right visualizes the typical database example: The employee is the traditional entity we observe at a given time, and when the attribute is updated, we strike out the previous version of the entity.

Nevertheless, time series databases (TSDBs) are analytical, and therefore offer a more convenient perspective on the problem. Indeed, a time series is a less complex object than a temporal entity, yet has its own nuances.

To understand what kind of entities a time series describes, let’s think about what information needs a query language like Flux aims to satisfy. Typically, when querying a time series database, we want to observe a phenomenon holistically over a time range. Although each series element contains individual timestamps, we focus on the big picture — the trend.

What kind of entities are we trying to represent?
Time-varying entities, entities whose attributes vary over time.

The answer comes directly from the literature on temporal databases: Time-varying entities are entities characterized by time-varying attributes — the variability of the value associated with an attribute is an integral part of the attribute definition. Formally speaking, a time-varying attribute is a function from the temporal domain to the domain associated with the attribute type.

Time-varying entities differ from traditional entities, which model an instantaneous view. In practice, a time-varying entity describes the variations of all its time-varying attributes over time. The figure above also shows the time-varying interpretation of the employee entity, where we can observe the evolution of his age and salary over time.

To better understand the concept, let’s consider the following scenario: The national billiard association (NBA) wants to track the pool matches for the yearly championship. Traditional database design would represent players and matches as entities of a tournament. However, the NBA wants to track the full match history. Intuitively, at least two time-varying attributes characterize a match: each player’s point total at any given time.

As a static entity typically translates into a Table (considering the modeling techniques mentioned above), when it comes to time-varying entities, we can consider the association Measurement ⇐⇒ Entity as valid. As time-varying entities may contain multiple time-varying and static attributes, we need to understand how to map such concepts into TSBDs and, in particular, InfluxDB. Moreover, the entity’s primary key may be composite; in that case, the TSDB counterpart should preserve the identifying nature of the attribute.

The table below shows our proposed naive mapping.

Conceptual Logical Physical
Time-Varying Entities Measurement Measurement
Time-Varying Attribute name Field key Field key
Time-Varying Attribute value Field value Field value
Attribute name (part of the primary key) Tag key Tag key with K convention
Attribute value (part of the primary key) Tag value Tag value with K convention
Attribute name (not part of the primary key) referenced field key with constant convention or external table
Attribute value (not part of the primary key) referenced field key with constant convention or external table

Notably, InfluxDB indexes tags by both key and value. Therefore, their cardinality may affect performance. We can treat attributes that do not take part in the primary keys but are relevant for several queries as tags to speed up query execution.

Additionally, time-varying entities described by several static attributes pose an exciting modeling challenge related to the notion of weak entities from relational database design.

A weak entity is an entity without a primary key that uniquely identifies it. Therefore, it must depend on the strong entity set for its unique identification.

Whenever it is possible to distinguish both a time-varying aspect of an entity and a static one, we can consider splitting them apart and representing the static aspect as a time-varying weak entity to be outsourced to an external database. In practice, InfluxDB allows time series enrichment using a variety of relational databases. This approach helps avoid redundancies by keeping the series succinct.

Finally, although the Measurement ⇐⇒ Entity association appears practical, such an association does not provide a complete perspective on InfluxDB abstractions, as we will see in the following.

What about relationships?
The discriminatory aspect is whether time-varying attributes characterize the relationship.

Like traditional database design, TSDB relationships are slightly more sophisticated in their modeling than entities. Intuitively, we want to observe the evolution of a relationship and its attributes over time. However, a relationship involves multiple parties (typically two, but possibly many). Therefore, we must distinguish the cases.

  • A relationship between static entities
  • A relationship between a time-varying entity and a static one
  • A relationship between two time-varying entities

Notably, this characterization is orthogonal to the relationship cardinality — one-to-one, one-to-many and many-to-many, as will become clearer. However, the discriminatory aspect is whether time-varying attributes characterize the relationship.

pool-table-game

Let’s dig into the details of each by extending our NBA example. Assume the billiard association wants to extend IRA analytics. In particular, it wants to monitor the relative position of each ball on the table and in collisions and strikes. The billiard balls have a unique texture that allows identification using special cameras pointing at each billiard table and calibrated to a table at match time. From a database modeling perspective, we consider two entities, balls and billiards tables and the many-to-many relationship “position” that tracks the assignment and the time-wise position of the balls on each table.

Balls and billiard tables are static entities identifiable within an entity set using unique IDs. As usual, entities have additional non-identifying attributes, for instance, balls have colors. The relationship between balls and billiard tables describes the position of a ball relative to the table center.

Given the game’s dynamic nature, temporal annotation characterizes this relationship: We are in case 1 — a time-varying relationship that connects two static entities. In particular, the relationship has two time-varying attributes, x and y, that are the relative coordinates of the ball over time.

In our TSDB design, we can represent the relationship like in the case of time-varying entities. The relationship name becomes the measurement, while the names and values of the time-varying attributes constitute fields keys and values, respectively.

many-to-many relationships

Given that multiple matches on different tables can use the same ball, and a billiards table hosts many matches (each with several balls), the relationship has a many-to-many cardinality.

In traditional DB design, we turn many-to-many relationships into so-called “composite entities” that build their primary key by borrowing elements from the entities participating in the relationship. Therefore, the composite key results in several tag keys and values.

Such an approach extends to case 2, when a relationship connects a time-varying entity and a static one. In that case, we have two options:

  • If the relationship has time-varying attributes, then construct a whole new series.
  • Instead, if the relationship has no time-varying attribute itself, we extend the tagset of the series modeling the time-varying side with additional tags representing the static side key and an additional tag representing the static entity name.

A similar method applies when reasoning about cardinality. In relational modeling, we attach a foreign key to either of the sides for one-to-one and many-to-one relationships. On the other hand, in TSBD design, we base our reasoning on the presence of time-varying attributes in the relationship.

In some cases, it is convenient to extend our conceptual modeling by transforming the time-varying relationship between the static entities into a time-varying entity with relationships with static entities. For example, we can represent the POSITION time-varying entity using a series that captures the state of such a relationship over time.

time-varying relationship

As before, we can use InfluxDB’s tags to represent these relationships within the series, as in the figure on the right. At a conceptual level, the series doesn’t represent the two static entities entirely. For example, it omits attributes like ball color. This issue is similar to what we discussed about static attributes of time-varying entities.

Let’s continue our NBA scenario by representing our last case, the relationship between two time-varying entities. In this case, we build on the extended schema we created. Now, we want to represent collisions between balls. To this extent, we consider the self-relationship between the POSITION entity.

While at the conceptual level, it is easy to understand the representation of collision as a relationship between objects in space and time, the representation of such a relationship is less intuitive. Indeed, the difference between COLLIDE and POSITION is the role of time, which in the former helps identify individual collisions. Continuing on the billiard examples, but in a much-simplified version that does not consider the balls’ volumes, we can define a collision as the occurrence of two balls in the same position at a given time instant “t,” — at t, ball1(x1,y1) and ball2(x2,y2) are such that x1==x2 or y1==y2. Such comparison implies a time-based join, comparing the position of the balls for each of the individual time instants. Such operation is not just algorithmically expensive but also unfeasible due to the infinite nature of the stream. Indeed, even considering the order nature of temporal data, the comparison scope remains unbounded.

As we discussed in the “Introduction to Streaming Data,” the common way to approach unboundedness is by using window operators. Therefore, by using a continuous query language like Flux, it is possible to represent time-varying relationships across time-varying entities.

As shown in the query below, the Flux query for the simplified billiard example is still quite sophisticated. It requires a self-join for each dimension to compare (x,y).

However, the most important aspect is the use of the window operator at line 7, which reduces the cardinality of the comparisons later. Modifying the value of the window (60 seconds is the average duration of a turn), we will identify a different number of collisions because the comparison scope will change.

An alternative implementation could use aggregateWindow to downsample the stream elements, but at the expense of the precision of the detection. In such a case, it is essential to choose the aggregation function accurately. For instance, “mean” will approximate the result differently than “last.”

In the query, we use pivoting to align the join condition by comparing the dimensions. This approach allows us to avoid pairwise comparisons for each combination. On the other hand, it requires filtering out those entries representing “self collisions,” including the collisions from the perspective of both the balls.

Finally, we unpivot the series resulting from the comparison of the dimension and using the recent experimental union, and we merge the two series back together.

In conclusion, modeling time-varying relationships between time-varying entities is an advanced task that depends on the window size parameter. A continuous query language like Flux makes this possible, but the task remains quite sophisticated. Finally, the table below summarizes our analysis for data modeling.

Summary table

Entity Entity Relationship
(‘s attributes)
Static Static Static Out of scope
Static Static Time-Varying See Time Varying Entity table 1, plus tags for the measurement name for the entities
Static Time-Varying Time-Varying See Time Varying Entity table 1, plus tags for the measurement name for the entities
Static Time-Varying Static Add tags to series for the right entity for keys attributes and the name of the left entity
Time Varying Time-Varying Time-Varying Requires windowing
import "influxdata/influxdb"

all = from(bucket: "training")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> filter(fn: (r) => r["_measurement"] == "position")
|> group(columns: ["ball", "_field"], mode: "by")
|> window(every: 60s)
|> pivot(rowKey: ["_start", "_time", "_stop"], 
         columnKey: ["_field"], valueColumn: "_value")

//ball cannot be key otherwise i have only self collisions
all1 = all |> group(columns: ["_measurement"], mode: "by")
all2 = all |> group(columns: ["_measurement"], mode: "by")

//ball cannot be key otherwise i have only self collisions
collisionx =  join(tables: {b1: all1, b2: all2}, on: ["_time", "x"])
|> filter(fn: (r) => r.ball_b1 != r.ball_b2)
|> map(fn: (r) => ({r with _measurement: "collision"}))
|> keep(columns: ["_measurement","_field","_value","_time","ball_b1", "ball_b2","x"])
|> group(columns: ["_measurement", "ball_b1", "ball_b2"])
|> experimental.unpivot()

collisiony =  join(tables: {b1: all1, b2: all2}, on: ["_time", "y"])
|> filter(fn: (r) => r.ball_b1 != r.ball_b2)
|> map(fn: (r) => ({r with _measurement: "collision"}))
|> keep(columns: ["_measurement", "_field", "_value", "_time", "ball_b1", "ball_b2", "y"])
|> group(columns: ["_measurement", "ball_b1", "ball_b2"])
|> experimental.unpivot()
//remerge the two series and write
union(tables: [collisionx, collisiony]) |> influxdb.wideTo()