Florian Rampp, Member of the Technical Staff at tado GmbH, will share how they use InfluxData to gather and serve analytics data collected from their Smart AC Control unit to help turn any home air conditioner smart. This device uses a variety of information collected (geo-location, temperature, user settings, current device functional state) to serve information to automatically control the environment temperature as well as letting users know when the device may need maintenance.
Watch the webinar “tado uses InfluxData to make any air conditioner smart” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “tado uses InfluxData to make any air conditioner smart.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Florian Rampp: Member of the Technical Staff, tado GmbH
Florian Rampp 00:01.410 Okay. So I hope the screenshotting now works. Welcome to my webinar about how we use InfluxDB at tado. Tado is a company based in Munich in Southern Germany. We were founded in 2011, currently having about 120 employees, and what we actually do is home climate control. I’ll quickly talk about our product in a second, just our mission first, maybe. So we try to save energy while not sacrificing comfort by more smartly controlling heating and air condition systems. Quickly, about me, I started at tado three years ago. I am a member of the team which is responsible here for our server applications, for the cloud infrastructure and deployment, and various databases and back-end applications.
Florian Rampp 00:57.745 I quickly want to start by introducing our products, and introducing where time series data for tado comes into play, and what we understand with time series data. The talk will not be so much about the basics of InfluxDB, but more on how we use it, with a focus on our main application for time series data, which is a temperature graph that we actually offer customers to inspect how their heating was working in the past. We call this the user report. I will quickly then wrap up with how we write to Influx and a few lessons learned. So let’s get started.
Florian Rampp 01:39.244 What does tado actually offer? We have a set of products, so basically, two sets of products. One is for controlling heating systems and one is for controlling air condition links. So the first heating-related product is out smart thermostat you can see here. This is basically replacing your room thermostat at home, and it’s battery-operated. Next to it, you can see a fairly new product we launched a few months ago, which is the tado radiator thermostats, and you replace your existing radiator thermostats with it. It’s also battery-powered with normal batteries. The room thermostat here has three AAA batteries, so very standard. Another device for controlling heating that sometimes is required for heating installations is the extension kit, here on the left. That’s required to be wired to the boiler in a few cases, mainly where the room thermostat talks to the boiler with a radio frequency connection. And the last part, I’m going to talk about that in a second as well, is the Internet Bridge, which offers the connectivity for the other devices. On the other hand, we have our smart AC control as the second product group. It’s not battery-operated. It has a power plug. And it uses WiFi as a protocol, so it doesn’t require an Internet switch. And it communicates, or it sends infrared commands to air condition links and thus is able to control the air conditioning. So with all the tado products comes the tado app on various platforms. And if you open the app for the first time and you install tado, you can see, for example, your current temperature of your living room and a few other things like the humidity. And very important, at the bottom, we have some visualization of some of the core logic of tado, which is the geolocation detection.
Florian Rampp 03:52.859 Basically, geolocation detection distinguishes between two modes. So if any of the residents is at home, then tado is considered to be in home mode that is indicated by this home symbol here. If none of the residents is at home, tado is considered to be in away mode. And then the symbol will change and indicate some away state, and the background turns into a green background. So blue stands for home mode, and green stands for away mode. So the idea is that tado then, based on this detection, switches off your AC or your heating when you leave and turns it on again when you arrive home or before you arrive at home. On top of that, what you can configure is a time-based schedule for different weekday types. So you can configure what the AC should do and which temperatures should be achieved through the schedule. Then one of the—or the third thing which is important for our customers is what we call the user report where customers can see one day of, for example, the temperature, which is the gray line here, or the humidity. It’s fairly idealized on this graph, so normally it’s a bit more spiky. They can see what the desired setting is right now. And at the bottom, very important, is whether tado was in home or was in away mode. So more about this in one second.
Florian Rampp 05:28.238 So quickly, why do we deal with time series data here? Or what is time series data for tado, and what’s the history? So let me begin by quickly explaining how time series data is generated or how sensor data for us is generated. So one important thing to start with is it’s quite important for us that the devices we have are battery operated. That really influenced the overall system design quite a bit. For example, we try not to have too much radio frequency communication because that’s what’s basically draining the battery. So it’s less the micro-controller which is draining the battery but more the radio-frequency or the RF communication. So that’s why we try to avoid to transmit inside temperatures every few seconds or so. So that’s what’s one important thing to know. So that’s why we are also adopting this low-energy protocol called 6loWPAN. It’s basically a wireless protocol for low-powered devices, and that’s how the devices communicate with our Internet Bridge, which is connected to the home router. Devices report inside temperatures as soon as the inside temperature of a room would change by more than 0.1 degrees. So that can take, potentially, quite a few minutes until there’s a change detected which is sent to the server. Basically, then this measurement is wrapped in a message and sent through a secure WebSocket to our cloud infrastructure which is a bunch of EC2 instances behind some load balancing at Amazon Web Services. From there, we then write it to our time series data store. So what’s the time series data store? What’s the history here? When tado had about 500 customers, time series was fairly simple. We had one MySQL table where we just wrote to using Hibernate basically domain object writing to this MySQL table, fairly straightforward.
Florian Rampp 07:42.885 This, of course, didn’t scale so well so when our customer base increased a bit, we replaced that by our custom solutions. So we bootstrapped a different MySQL instance, configured it differently, optimized it for lower transaction levels because basically, the only thing we did was appending to tables. We used a feature called MySQL table partitioning and we asynchronously wrote to this or appended to these tables. In the end, we ended up having a MySQL instance where we had one table with partitions per time series type. For example, one table for all device inside temperatures. That’s also in scale forever, so when we reached around 100,000 customers, we introduced InfluxDB. So that’s quite some time ago and we hope or believe that it will probably scale to our current requirements of millions of customers that we are facing right now. Okay. So let’s dive into the customer temperature graph or what we call the user report. So it looks roughly like this if you open it in your apps. And you’re going to see one full day of data, and you can wipe between days. So if you would swipe to the left here, you would go to the day before today, to yesterday, the day before yesterday, and so on.
Florian Rampp 09:20.478 When the report is opened in the app for the first time, you always see the current day, so today. As I mentioned already, there’s this temperature graph, and there is the bar at the bottom indicates whether tado is in home mode. So blue means home. That means at least on resident is at home. Or tado is in away mode. That’s the green bar, and that shows that all residents are gone. There are some additional data here, like the humidity, like I mentioned already. But let’s ignore that for this presentation. Okay. So we have a current version of our user report. It is powered by this MySQL implementation, the second step, and it grew very slow. So the response times or customers opening their user report, it takes seconds to load right now. Also, it’s lacking a few features, like humidity graph, and it also doesn’t show the weather data, so we decided to implement the new version of that, and we tried. We based it on InfluxDB.
So what I will present now is our learnings from implementing this new version of this user report. What we are measuring right now with the current report is that we have about peak rates, up about 10 requests for the report per second, which is the peak rate. And most of the requests are for the data of the current day, so of today. What we are aiming for with Influx now is that we want to have response times on the 99th percentile of 500 milliseconds, and we want to keep data for 12 months, so being able to show customers their history for a full year.
Florian Rampp 11:23.191 So what we basically did with to start checking out Influx and how it performs and if it fulfills these requirements, is that we multiplex the current read and write load for the report unto Influx. So whenever there’s a customer doing a request of the current report, we’re adjusting the different spreads, basically branch off the logic for InfluxDB and just scratch the result. So it’s not on the critical path, but this way, we were able to simulate realistically the read and the write load.
Florian Rampp 12:01.756 Okay, so what’s the different types of time series we’re looking at? Basically, we’re distinguishing between two types of time series here. The first type is something where inside temperature is a very good example. It’s time series that continuously change which is like physical measurements which are continuously measured and we want to plot them in as a graph inside the user report. Therefore what the server should send to the mobile apps is a list of points with 15-minute distances. So what we in the end want to send through our APIs to the mobile apps is, “Hey at midnight, it’s 64.5 degrees Fahrenheit, quarter past midnight 64.1 degrees Fahrenheit, half past midnight 63.9 degrees Fahrenheit and so on.” Fairly straightforward.
Florian Rampp 12:58.877 The second type of time series is something we call discrete state time series. And I think that’s where things get a little special for us and where the challenges started. So what we understand is discrete state time series is time series which are rather sparse in their data. It’s things that might not change often, for sure not every minute or every hour, but sometimes just a couple of times per day or week. One example is the home versus away. It might just change once or twice per day, or sometimes if people are on vacation it doesn’t change for two weeks, it’s always away. So what we want to return to the mobile apps for these cases is a list of intervals with the current value. For example, we would like to return to the mobile apps from midnight to 7:43 tado is in home mode. From 7:43 until 16:13, tado is in away mode, and from 16:13 to midnight again next day, tado is in home mode. So what we care about is that these time series have to accept time stamps, so I’m not aligned to some quarter hour grid or so, but we really want to show in the report the exact time the tado went from home to away mode. There are some more examples here but even more rarely changed, but that’s I think basically the requirement.
Florian Rampp 14:26.245 When we started looking into that we opened a feature request for Influx to kind of support this type of series better, so make Influx return these intervals for us already and that’s the feature request validation number 7581 which we opened, which is still pending. So we’re aware that this is kind of a change for InfluxDB, so we implement it. Look around or we have to implement different types of series that I will show in a second to basically simulate this functionality.
Florian Rampp 14:59.738 So quickly what does the schema look for our user report? So we have a measurement called home and one single tech the home ID. So every home and home is what we call an account. At tado, one home has multiple residents, so consider a home being like a central account. Every home has a unique ID. Then for the sake of this conversation, we just look at two fields, the inside temperature and the tado mode. And we have a retention policy keeping data for 12 months. Okay, so let’s look at the first type of time series. What query do we need to InfluxDB for requesting the inside temperature, let’s say for example, for yesterday the 15th of May 2017? First thing we want to do is we want to get all the data points in the query interval. So the query interval is the full day. So the query looks like select the inside temperature field from the home measurement where the home ID is 42 and the time is greater or equal than the start timestamp of our query. So midnight, 15th of May, and time is less than midnight on the 16th of May.
Florian Rampp 16:28.219 First thing you probably realize, our customers live in different time zones they request not UTC. Align days in their time zone, so that’s for example why here we have the time is 2:00 AM UTC. And finally, we group this by time, 900 seconds which corresponds to four an hour. And we use the field previous to fill empty buckets with the value of the previous bucket. So that sounds fair enough, but we have one more requirement. There might be empty buckets in the very beginning. That’s why we need additional type of query that I want to show, but first quickly illustrate the problem again. So assuming we have one inside temperature written at 1:42 and another one is written at 2:34. Now as we just saw, the query starting at UTC time 2:00 AM, that would result in the first two buckets. Let’s assume that’s a 50-minute bucket here being emptied. The third bucket then has the inside temperature measurement in it, so it would have the value that’s measured here and the third—sorry. The fourth bucket has the same value due to the fill the previous year, that’s what fills the fourth bucket. But what you want to have now, we want to—to the mobile apps on the API also provide values for the first two buckets. And what we want to do is we want to provide the inside temperature reported at 1:42 here. That’s why we have this one additional query for the additional temperatures, or for any continuously changing time series, and that is using the last query of InfluxDB.
Florian Rampp 18:23.466 So we do select last inside temperature from the home measurement where home ID is 42, and time is greater or equal than some lower bound and time is smaller than the start of the query period. So this timestamp here corresponds to this timestamp here. So look before the start of the query, which is 2:00, or midnight in the timestamp of the home of the 15th of May, and what we have at the lower bound is the time of the account creation. So we know there cannot be any inside temperatures before the account or the home was created. That’s why we can use that as a lower bound and we consider this mainly to be an optimization for the last query. So that you don’t have to go back in time indefinitely, but that’s a lower bound, before that, we know that there cannot be any inside temperatures. Let’s assume the home would be created on the 3rd of January. So that’s the first type of time series. For inside temperatures, we have these two types of queries. We have the query for getting basically the raw data aligned to 50-minute buckets using group by clause here, and we have the last query that we use to—in the case that we have empty buckets, in the beginning, to fill these buckets with the last value.
Florian Rampp 19:54.247 Next type of queries, the discrete state queries. So let’s look at the tado mode example. We are querying the tado mode for the 15th of May again, so what type of queries do we need for that? The first is the query for getting the raw values within the query interval. It’s quite simple again, so select the tado mode from the home measurement where home ID is 42 and time is greater than the beginning of the query interval, the midnight of the 15th, and time is less than midnight of the 16th, so querying the full data. Second type of query, and it’s a similar problem we are having for the continuous ones, is that we want to have a value for this time series at the beginning of the query interval. That’s why we need another last query here. So we do select last tado mode from home, where home ID is 42 and we use the same time constraint as for the continuous queries. We have this lower bound for when the account was created as an optimization and we have a time—time should be less than the start of the query interval, so midnight of the 15th. So that’s how we find the first—the value right at the start of the query interval so that we can start or return the correct intervals to the mobile apps. So what’s required for that? Some memory processing, post-processing. So what we do is we build these intervals as mentioned before from the data above. So from this last value and from these raw values, that’s how we can build from midnight to 7:43. It is a home mode, that would be the value of the last query.
Florian Rampp 21:44.325 And then from 7:43, we find a data point inside the interval. From 7:43 until the next point from this query it would be a wait. What can also happen is that we have some duplicates, so we have consecutive points in a time series with the same value and we want to get rid of these because we don’t want to return to the mobile apps two intervals for the same value. So all the intervals should have different values.
Florian Rampp 22:20.074 Some optimizations we can do now, so we can combine multiple queries. For example, for the last values into one query. So instead of having one select last inside temperature, and a second select last power mode, we can combine all selects that have the same—that go through the same measurement, the same set of text, and that have the same time constraint. We can bundle them up into a single query. So we can say, “Select last inside temperature, last tado mode,” and so on. The problem with this here is that it will return “epoch time zero”. So whenever you have two last queries in here and the timestamp of these last two points doesn’t match, InfluxDB returns a epoch time of zero. So the 1st of January, 1970 and so on. That doesn’t matter for us because we’re not really interested in the timestamp of these points. We’re just interested in their values to fill the first bucket or the first interval.
Florian Rampp 23:25.972 The second optimization we’re doing is our second wave. We’re curing this, we’re issuing all the queries, the results on the queries for the last values and for the raw data inside the time series in one HTTP request to InfluxDB. There’s a minor thing we stumbled over or we had to solve. And that is that you cannot really—when you do this you get a response containing multiple response objects in JSON. And you can just find which response object belongs to which query, basically by the position in the recurrent JSON list. But that’s a minor inconvenience, we think, and it’s easy to solve.
Florian Rampp 24:07.299 Okay. So, as you have realized, the last queries are very important for our use cases. And it turns out, when we started, they were very slow. On InfluxDB 1.0 they were slow that the overall user report response time we measured, by again multiplexing, it was 5 to 10 seconds. So not acceptable and we wanted to be on the 99th percentile. We wanted it to have 500 milliseconds. So we tried to figure out different solutions here. So we tried to use the Limit 1 instead of the last. It didn’t really work. So how did we actually find out that it was the last query? We worked together with InfluxDB staff and they reported to us that the most resources are used within the last queries.
Florian Rampp 25:00.352 So we tried to replace the last queries with Limit 1 for example. Ordered by time, sending, Limit 1. Didn’t work either. Very slow. So, we tried to look into continuous queries. Which continuously every quarter of an hour right data into the time series so we don’t have to do the last queries at all. It turned out to be quite hard due to some restrictions with time zones as well and some InfluxDB syntax restrictions. Anyway. So, we had a plan ready to basically solve the problems ourselves. But thought about upgrading to InfluxDB 1.1 first just as a convenience, and it turned out that there’s going to be the miracle of the 15th November. What we call it here. Because when doing this, when doing the upgrade to 1.1 on the 15th November, it turned out the last queries were sped up by a factor of 200. So, what happened was very surprising to us. So we investigated a bit and we stumbled over the pull request 7494 included in the 1.1 release, which basically featured this change here in the engine.go file of InfluxDB.
Florian Rampp 26:13.943 So what we see here is that this pull request somehow changed the sort order and thus, the last three do not need to iterate over all points to find the last one. I’ll just use a sending false, Limit 1 which for some reason wasn’t done before and it sped up the queries by a factor of 200 for us. And I think that was kind of the breakthrough for us, implementing the user report with Influx. So that’s about the query part of Influx. Let’s quickly look into how we are writing to InfluxDB. So, what we are using right now, we have a hosted instance on Influx Cloud, size megawatt one. Using two data notes or replicated cluster.
On our side, we have 5 to 10 application instances which write to this cluster in parallel with a total write rate of 5,000 points per second roughly. So all application instances together have a write rate of 5,000 points per second. We first tried to write every single point which of course, turned out to be a very bad idea using a lot of HTP connections. So, we realized we need to use batching here. So, we introduced some batch lighting, which means we either wait for 500 points to arrive or 200 milliseconds, whatever happens first. And then write the full batch of accumulated points to InfluxDB.
Florian Rampp 27:54.105 The write times that we observed for these batches is that on mean—the mean time was 10 milliseconds for a write, the 95th percentile is 50 milliseconds, so very fast, actually. Only, what’s bothering us quite a bit is that there is quite frequently, second writes which take up to 10 seconds. So, these spikes of up to 10 seconds, you can see in the graphs here. This one’s taking 2.5 seconds, this one’s taking 1.5 seconds. It’s mainly individual write requests but they take quite long and we assume it’s related to InfluxDB during compactions, while we write.
Florian Rampp 28:38.854 So one thing I also wanted to quickly dive into which is not super related to InfluxDB, but we tried to isolate InfluxDB from—so the performance of InfluxDB from our application, to make it resilient and fault tolerant, our system and therefore we use a library from Netflix called Hystrix. It’s a travel library and it allows us to isolate system failures and introduces tolerance against latency. What it actually does, it prevents failures of downstream systems to affect upstream systems. So it isolates or prevents the cascading of failures and latency of downstream systems to any way effect upstream systems. How it does that is, amongst other things, by a pattern called circuit breaker. So what a circuit breaker is, in the real world, it has two states, it is either tripped or open, or it’s closed. As long as the circuit breaker is closed, all the rights go to InfluxDB. But as soon a certain rate of writes fails, the circuit breaker opens, and the writes don’t go to InfluxDB at all but fail early. So that’s called low load shedding. So we prevent overloading the InfluxDB. If it’s low already, there’s no point of shooting more requests to it. So that’s what the circuit breaker prevents. After a certain timeout, individual write requests would go through to test if InfluxDB is up again or any down-screen system is up again. And if that succeeds, the circuit breaker would close and all write requests would go to InfluxDB again.
Florian Rampp 30:25.823 But also what Netflix Hystrix also offers is a built-in collapse in feature that we use to implement the batching for the rights that are already mentioned. And the last thing that we intensely use is something called a fallback. So fallback for Hystrix is an automative way of execution when the primary command fails. So if the batch writing fails, for example because the circuit breaker is open, then the fallback allows to do something else. And what something else for us means is that—so when everything goes well, we write InfluxDB. If this fails, the fallback writes the full batch of data we want it to write for Influx into an S3 bucket into one’s file SD bucket. Then, we have an additional periodic chart, which is implemented as an AWS Lambda function, which queries for any files in this S3 bucket and tries to write them to InfluxDB. If it doesn’t succeed, it tries it again a few seconds later or a minute later.
Florian Rampp 31:36.090 So this way, we have a reliable writing to InfluxDB. Okay. I’m already coming to some conclusions. So with InfluxDB, we had a pretty bumpy start. The old versions didn’t support the things we need. The last three was only introduced in the version 1.0. But it was very slow. In the beginning, it took us quite some time to understand the data model and its implications as well. It should be multiple measurements, should be somewhere called the time series time and measurement names or a text. So which set of text should we use, which set of fields and stuff? We were wondering in the beginning also about the clustering aspects, which retention policies should be used. What’s the quorum for rights? So when should a right to a cluster be considered successful? So that’s quite a few things that bothered us. We tried to trim the batch sizes. But in the end, it was quite some experimentation and what proved to be very useful is having this multi-flexing to InfluxDB and parallel to still serving the old people.
Florian Rampp 32:48.982 So what’s the state right now? The new user report as implemented is mentioned. It will be released within the next few weeks. And we are very excited about that, so we are looking forward to the release and see how InfluxDB then really performs under reload basically. When the customers realize the new report is cool and very fast, they will use it more, so we expect higher load here. So that’s why we’re looking forward to that. In the future, we also want to implement a few more features with the InfluxDB, the TICK Stack in general. One thing we might look into is some screening analysis using Kapacitor. So we have this requirement that we want to trigger certain actions on the service size when inside temperatures above a certain threshold or a second thing that we want to do is to improve the back-end graphing that we are using for debugging customer problems and therefore, we might use Chronograf. So that’s our experience with Influx. Thanks for listening so far. And I think we are open for questions now. Chris?
Chris Churilo 34:05.357 Wow. Thank you so much, Florian. That was really great, really detailed. And I really appreciate, and I’m sure everyone, all of our attendees also appreciate that you’re sharing with us your detailed experiences with using InfluxDB. Okay. If we don’t have any more questions, then I’m going to let Florian go because I know it’s a little bit late in his time zone. I do want to thank him once again. It was a really great presentation. I feel like it covered the challenges that you guys had really well to help us understand how you’re able to use Influx data with your particular use case. Just one more reminder, I’ll have this recording up. So everyone can take a listen to it. If you do have questions later on, please just shoot me a line and I’ll make sure I forward these questions to Florian. So don’t be shy about that.
Florian Rampp 34:58.299 Thank you, Chris. Thanks for offering the platform to share our experience. And I enjoyed it. Thank you.
Chris Churilo 35:04.039 Oh. Thank you so much. Have a great evening, everybody. And we’ll see you again.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.