Backfill Missing Time Series With SQL

Time series data streams are often noisy and irregular. But it doesn’t matter if the cause of the irregularity is a network error, jittery sensor, or power outage – advanced analytical tools, machine learning, and artificial intelligence models require their data inputs to include data sets with fixed time intervals. This makes the process of filling in all missing rows and values a necessary part of the data cleaning and basic analysis process.

SQL became InfluxDB’s main query language as we fully embraced the Apache ecosystem, but we had to build some functionality to fill in the gaps between the traditional capabilities of SQL and what time series data workloads require. One of the custom query functions we built to address missing data and inconsistent time windows is date_bin_gapfill.

Adding rows

date_bin_gapfill calculates the time interval based on specific parameters determined by the user, and if no rows exist within a certain time interval, the function inserts a new row with a time value set to the time interval start with a null data value.

Sample query:

SELECT
  date_bin_gapfill( '30 minutes', time) as _time,
  room,
  avg(temp) as temp
FROM home
WHERE
	time >= '2023-04-27T08:00:00Z'
	AND time <= '2023-04-27T10:00:00Z'
GROUP BY _time, room

Results:

SQL - Adding Rows

Adding rows and inserting data

Interpolate

Adding interpolate inside the date_bin_gapfill function replaces the null data values with linearly interpolated values between the last value and the next value. Be sure to place this before the aggregation function (avg in the query below) so that it includes all values when it runs.

Sample query:

SELECT
  date_bin_gapfill('30 minutes', time) as _time,
  room,
  interpolate(avg(temp))
FROM home
WHERE
	time >= '2023-04-27T08:00:00Z'
	AND time <= '2023-04-27T10:00:00Z'
  GROUP BY _time, room

Results:

Adding rows and inserting data

LOCF

locf stands for last observation carried forward. Adding locf before the aggregation function replaces the null values with the last value.

SELECT
  date_bin_gapfill( '30 minutes', time) as _time,
  room,
 locf(avg(temp))
FROM home
WHERE
	time >= '2023-04-27T08:00:00Z'
	AND time <= '2023-04-27T10:00:00Z'
GROUP BY _time, room

Results:

LOCF

Conclusion

This demo video shows the powerful function in action.

For more details on working with SQL and time series data, check out our docs.

To start building something cool with InfluxDB and SQL, sign up for your free cloud account today.