TL;DR InfluxDB Tech Tips - Missing Data with DISTINCT() & the INTO Clause, Unexpected Timestamps with MAX() & More
By Regan Kuchan / Apr 13, 2017 / InfluxDB, Tech Tips
What happens when you cross a function with an
INTO clause or a
GROUP BY time() clause? Occasionally, some curious results. In this post, we describe the unexpected when working with InfluxQL functions in the
Missing data with DISTINCT() and the INTO clause
Q: I’m trying to write the results of the
DISTINCT() function to a different measurement. The only thing is - the results of the query without the
INTO clause and the results of the query with the
INTO clause are different. Is this the expected behavior? I have a feeling I’m missing something.
> SELECT DISTINCT("personality_type") FROM "twins" name: twins time distinct ---- -------- 1970-01-01T00:00:00Z a 1970-01-01T00:00:00Z b <--- Three results! 1970-01-01T00:00:00Z uncategorizable
> SELECT DISTINCT("personality_type") INTO "twin_types" FROM "twins" name: result time written ---- ------- 1970-01-01T00:00:00Z 3 <--- Three written points! > SELECT * FROM "twin_types" name: twin_types time distinct ---- -------- 1970-01-01T00:00:00Z uncategorizable <--- Just one point!
A: The behavior that you’re seeing is the expected behavior; it comes down to how InfluxDB identifies a single point and how it handles duplicate points.
Notice that each result of the
DISTINCT() query without the
INTO clause has the same timestamp. InfluxDB assumes that points in the same series and with the same timestamp are duplicate points.
When you add the
INTO clause to the query, InfluxDB writes three duplicate points to the
twin_types measurement. When the system encounters duplicate points, it simply overwrites the previous point with the most recent point so you end up with only one result in the destination measurement. It’s a little unexpected but that’s the explanation for what you’re seeing.
Unexpected timestamps with MAX() and the GROUP BY time() clause
Q: I’m trying to find the maximum number of eggs for specific time intervals. The results in the
max column are accurate but the timestamps in the
time column aren’t what I expected to see. I’ve included a simplified version of my data below. Why do the timestamps seem off?
> SELECT * FROM "house" name: house time eggs ---- ---- 2017-04-12T20:00:00Z 1 <--- Max value in the first 20-minute interval 2017-04-12T20:10:00Z 0 2017-04-12T20:20:00Z 3 2017-04-12T20:30:00Z 5 <--- Max value in the second 20-minute interval
> SELECT MAX("eggs") FROM "house" WHERE time >= '2017-04-12T20:00:00Z' AND time <= '2017-04-12T20:30:00Z' GROUP BY time(20m) name: house time max ---- --- 2017-04-12T20:00:00Z 1 2017-04-12T20:20:00Z 5 <--- I'd expect this timestamp to be 2017-04-12T20:30:00Z
A: The timestamps that you’re seeing are a consequence of the
GROUP BY time() clause. That clause automatically overrides
MAX()’s original timestamps with timestamps that mark the start of the relevant time interval.
In your case, the second timestamp in the results (
2017-04-12T20:20:00Z) refers to the interval between
2017-04-12T20:20:00Z and just before
2017-04-12T20:30:00Z, not to when the maximum value occurred in the raw data. Almost all* queries that use an InfluxQL function with a
GROUP BY time() clause overwrite timestamps in this way.
- The only exception is the
SAMPLE()function. See the
SAMPLE()documentation for additional information.
Unexpected consistency with ELAPSED() and the GROUP BY time() clause
Q: I’m using
ELAPSED() to calculate the time gap between median values in my data. The results I get (see below) always show the same time gap. I’m having a hard time believing that my median values are always exactly ten minutes apart. Am I using the
ELAPSED() function correctly?
> SELECT ELAPSED(MEDIAN("clicks"),1m) FROM "puppies" WHERE time >= '2017-04-12T18:30:00Z' AND time <= '2017-04-12T19:10:00Z' GROUP BY time(10m) name: puppies time elapsed ---- ------- 2017-04-12T18:40:00Z 10 2017-04-12T18:50:00Z 10 2017-04-12T19:00:00Z 10
A: You’re using the
ELAPSED() function correctly from a syntax perspective but your query wont give you the results that you’re looking for. When you use
ELAPSED() with a nested function and a
GROUP BY time() clause, InfluxDB always returns elapsed values that are equal to the
GROUP BY time() interval.
When you run your query, InfluxDB first calculates the results for the nested function (
MEDIAN()) at the specified
GROUP BY time() intervals and then it applies the
ELAPSED() function to those results. In the first step, the
GROUP BY time() clause automatically overrides
MEDIAN()’s original timestamps with timestamps that mark the start of the each
GROUP BY time() interval. When the system applies the
ELAPSED() function to the results of that first step, it encounters timestamps that are always ten minutes apart and not the original timestamps of the
- Downloads for the TICK-stack are live on our "downloads" page.
- Deploy on the Cloud: Get started with a FREE trial of InfluxDB Cloud featuring fully-managed clusters, Kapacitor and Grafana.
- Deploy on Your Servers: Want to run InfluxDB clusters on your servers? Try a FREE 14-day trial. of InfluxDB Enterprise featuring an intuitive UI for deploying, monitoring, and rebalancing clusters, plus managing backups and restores.
- Tell Your Story: Over 300 companies have shared their story on how InfluxDB is helping them succeed. Submit your testimonial and get a limited edition hoodie as a thank you.