In this video, David McLean reviews how BBOXX turned to InfluxData to help them build the monitoring service for their solar panel solutions. In particular, he shares how InfluxCloud allows storage, retrieval and continuous real-time analysis of the data that BBOXX collects. By combining powerful machine learning algorithms with the InfluxCloud infrastructure, BBOXX can identify trends, usage patterns and even detect problems before they exist.
Watch the webinar “How BBOXX optimizes performance for 85,000 solar panels in rural areas” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How BBOXX optimizes performance for 85,000 solar panels in rural areas.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• David McLean: Senior Developer, BBOXX
David McLean 00:02.323 Oh, great. Okay. All right. Well, hi, everyone. My name’s Dave McLean. I’m a lead developer on the SMART Solar system at BBOXX, and so I’ll be giving this talk about how BBOXX are using Influx. And just to give you a brief overview of what we’re going to go through, I’ll start by talking about BBOXX, what BBOXX does for the company, and what I do there, and why we use Influx. And then I’ll talk about exactly how we’re using Influx specifically, why it’s helpful, and what challenges we face when we use it. So let’s get started. BBOXX is a company that supplies solar home systems in rural East Africa. So a solar home system is a solar panel connected to a battery. Ours come with a set of electronics inside, some control electronics to allow us to remotely control the whole unit, and also some sensible power control electronics and a series of useful connectors. So that’s things like USB output or standard DC output for appliances like lights and radios and low power TVs. So we sell these solar home systems and we’re targeting customers who have no access to the national grid, so we’re looking at customers with really low energy requirement, a few watts at a time, really. Maximum, maybe 50 watts out of the day. And we’re looking at really the lowest common denominator here. So we’re trying to hit the lowest price band and in order to make things affordable for our customers, we sell these units on payment plans. And so a crucial part of selling a solar system on a payment plan is the ability to remotely monitor and control that unit. So we need to know if there are any faulty units, if there are any units that are behaving badly, any batteries that are failing, or any other technical errors that we might have so that we can replace that unit. We replace our units free of charge if there are any issues. And we also need to be able to control the units remotely to enable them and disable them as our customers pay. So SMART Solar is the section at BBOXX that handles that remote connection to the units, and that’s the system I work in. And so we’ve been working on a system that collects all the data from these units and distributes it throughout the company in a useful way.
David McLean 02:40.862 So the units are connecting over the GSM networks with an embedding SIM card inside the control unit. The control unit is the gray box down here with the B on it. And so they connect by the GSM networks in each country. So SMART Solar gets access to two sets of data. So we have some technical data about the product like the identification number of the product, what battery type it’s got, the current software version of that product is. Also, some time series data. That’s voltage, current, and temperature that we record from the box and it uploads every four hours. Some logs from the box’s firmware. We also store some derived data that we calculate out of the raw data, so instantaneous power usage or summaries like daily energy usage. So we calculate those and we put them back into Influx. Obviously, it’s the time-series stuff that we’re interested in here and that’s what I’m going to talk about for the rest of the presentation. So here’s an example of some time-series data that we might collect from a box. We can look at the charging period of the box and the usage pattern of the box. You can see that the most basic thing that we might ever want to do is just to collect this data from a unit, store it, and then retrieve it so we can just look at what the usage pattern for the last years or days were. That’s the kind of basic requirement and that’s pretty simple, and there are lots of ways of doing that. And a times series is something that makes sense for scaling up nicely. When we have thousands or hundreds of thousands of units, still being to access this data reliably and rapidly is a really critical requirement.
David McLean 04:27.995 Once we’ve got the raw data to how it needs to be, we start being able to do some more interesting things with that data and to drive business value with the data that we collect. So, for example, if we’re looking at faulty units or units that are having problems, we can run analysis on the raw data to try and predict the state of health of the battery. So anyone who’s got a mobile phone or a laptop will probably know that batteries degrade over time and the performance drops off. And this graph here, we can see that kind of performance drop-off happening. So if we calculate the state of charge of our battery, it’s wobbling around to start. Start rate it looks quite healthy and then it really starts to drop down. At this point here, we probably raise an alert within our system. We raise an alert to our technicians to say, “We think this unit’s having a problem. We think that the battery’s failing. We should ring that customer and maybe check that our analysis is correct or see if they’ve had any problems, and potentially ask them to bring their products in. And then we can replace it before it actually fails and stops providing them with the energy that they paid for.” This is the primary focus of SMART Solar. It’s on monitoring the performance of our units, identifying failures and potentially, predicting failures and hopefully, preventing them affecting our customers.
David McLean 05:53.342 And a secondary purpose for the data on an individual level is looking at the individual system usage. So this, I think, is current usage for a particular unit in a day. And by taking a look at the usage characteristics, we can start to gain some insight into exactly what our customers are doing. So in this graph, we can see at the start in the day, around here, sunlight happens and the unit starts to charge. And then charging happens, our customer gets up. These two large spikes here are them turning the TV on. That’s the TV, sort of the large power draw. The TV goes off, so presumably, they leave the house, perhaps there were kids going to school or going to work, and charging continues throughout the day with some light usage. Then as it gets dark, the product stops charging, and we can see these large spikes again of usage. And these look more like some lights, as well as possibly the TV, or a charger. And then later on in the day, we see that around midnight or so, everything gets turned off, and then the system enters a sort of dormant state. So this kind of information is really useful for our business because it’s extremely valuable to know how our customers are using data, are using the energy that we provide. And there are often some quite surprising results from trying to work out how our customers are actually using energy.
David McLean 07:27.954 A quite interesting example was when we started out, we used to find that some customers would leave lights on all night, and we couldn’t really work out why. They would ring up and say, “My Internet’s not working,” and “I’m not getting enough energy,” and we’d look back at the logs, and see that they left their lights on all night. We’d ask them why and they’d say, “Oh, it’s for security.” There are no streetlights out in rural Africa and so having a light on outside really improves the security of your home. That allows us to understand that’s how customers would like to use our products, and maybe produce some packages that have a greater capacity, or really low-level lighting so that they can place security lights outside their house. Learning about how our customers use the data is really important for us too. So there’s the two examples of how we’re using our time-series data on an individual customer level. And then on the wider scale, we can start to look at some more aggregate values. So this is a neat graph, this is showing a huge spike in usage on the sixth of September. That was the day that Kenya played Zambia in the Africa Cup of Nations. This spike is all of our customers tuning in and watching that on their TVs. So we can see large scale usage trends in usage across rural Africa, we can see what kind of things people are interested in doing, what they would like to do and what potentially we could provide for them or they might want us to fly to them. If we could do it at scale.
David McLean 09:07.106 And other interesting things that we’ve done recently. There was a solar eclipse in Africa and we can track that with our panels. So this graph shows you the charge profile the panel on two different dates, the 13th and 1st of August and September. So the green line is the day without the eclipse. You can see the standard charge profile here. The sunrise happens, the unit charges and then it starts charging and then discharges at night. And the red profile, this is eclipse day, and about here we can see it gets dark, the unit stops charging the way it should do. No charging happens and then the eclipse stops, it passes and the unit continues to charge again. So out of color, we can monitor things almost like weather patterns with our unit. And although realistically it’s quite easy to tell if there’s a solar eclipse happening. You just look outside, if it’s dark, that’s a solar eclipse. So something that potentially be a bit more helpful is the use of wide coverage of products within the countries. So this graph is just a map showing where all of our users are in Rwanda. And you can see that we’ve got a pretty universal coverage of BBOXX unit throughout Rwanda. And as I said earlier, the units connect via these sim cards that are embedded in them, the GSA networks. And when they connect, they send this information about what network they are connected to. And now that we have wide coverage, that means that we can start to look at which networks are affected in Rwanda, and which networks are having problems.
David McLean 10:45.152 So this chart here shows the colors correspond to which network each unit is connected to. And this gray block at the top is [inaudible]. And we can see that on March 1st [inaudible] had a widespread network outage. And so this starts to, actually, be really useful information that we could provide to other people. We can reliably tell which networks are correctly functioning, at any given time, and potentially provide feedback to those networks or to parts and services about what networks are affected in which areas and which locations within Rwanda. So that’s the kind of overview of different kinds of data that we collect, and how we use it on an individual level or aggregate level and what kind of things we might want to do from it. And so now I guess I’ll move on to the main purpose of this talk about Influx. About why we store data in Influx, why it’s helpful and what we like about it and the challenges we face when using it. And so the next bit I’ll start by talking about the data plot point, how we get data into Influx. And then I’ll discuss our schema, how we store the—how we structure data internally within Influx and what that allows us to do and a brief discussion of our retention policies and database configuration. And that’s really the main configurable part of our system. We’ll have a quick look at system performance, and then I’ll talk about the things that you can’t do in Influx because there’s a lot of discussion about what you can do, but people often miss out on things that it’s not good—the things that you have to avoid doing when you use it. So those are things that I’m going to talk about and then hopefully questions at the end will be where I can really answer some of your more specific questions.
David McLean 12:41.713 So let’s dive into that. So our data pipeline really looks a bit like this. So we have units on the ground, and they’re connecting over the GSM networks into our backend application, and that’s all hosted in Amazon web services. So this is a slightly simplified diagram. They actually connect through a proxy service which then resends the data over http, but it’s not particularly relevant to this, so this will do for now. And our back-end services then pipe the data directly into Influx there. And we have a really small amount of filtering here. We prevent any data being written into the future in Influx. So if we have some garbled data come in, we just prevent anything going really far into the future, and we prevent anything going really far into the past. But other than that, we don’t really do any processing on the raw data coming in, we just pipe it straight into our Influx setup. We’re using InfluxCloud. So InfluxCloud is the fully hosted, clustered version of the open source InfluxDB software. And that means that Influx are hosting the entire data port. They’re hosting it also in Amazon Web Services. So our back-end application and Influx are in the same data center with Amazon, which means it’s good low latency between the database and our back-ends, and that makes data ingest and data querying nice and quick.
David McLean 14:18.842 So once we’ve got the raw data in Influx, we start to run analysis on it. And we run all of our analysis just using Python. And in particular, we run a lot of our analysis automatically. So each time the unit connects to our back-end services, and that back-end application writes data into Influx and kicks off an asynchronous Python task, which goes and decides whether or not that unit that just connected has been analyzed recently. And if it hasn’t, then the Python task will go to Influx, and retrieve a large chunk of data, typically about seven days’ worth of data and run a few analyses on it and look at the results. And depending on exactly what analysis we run, it will take one of two actions. So of the analyses are interested in generating the right data. So I talked before about generating power usage or the daily energy usage per unit, so for those kind of the right fields, we want to write those back into Influx. So the Python scripts will just connect back to our InfluxCloud and write information in there. And for other things, like raising alerts on faulty units, we’re interested in informing the rest of the company. So at that point, the Python script will go away. It’ll talk to some of our other services and start raising alerts within the system and start generating actions to core customers or to recall units that are faulty.
David McLean 15:56.429 So it’s a relatively simple pipeline at the moment. So we just transmit data once every four hours, the raw data goes into our back-end, and is written to Influx. And once every day or so, each unit has the last week of data analyzed, checked for any inconsistencies or anything that we might want to raise alerts on. And also, we write with the derived fields back into Influx. So hopefully, that gives you a good idea of how we’re moving the data around. And so at this point, it’s probably good to talk about how we structure our data inside Influx. So Influx holds its data in—well, it starts by using measurements, fields, and tags. So a measurement, which refers to a category of values that you might be interested in looking at. So, for us, the data that we send, the raw data here that comes in, we refer to that as telemetry data. So voltage, current, and temperature, that’s all telemetry data. So it makes sense for our measurement, our category bucket, to be called telemetry. And inside that, we have data points, which fields and tags. So the fields refer to the name of the value that you’re measuring. So if you’re measuring voltage, you’ll fill this voltage. If you’re measuring current, you’ll fill this current. And you can have quite a large number of fields in each measurement. It really depends, exactly, what data you’re gathering. Another way to think about this is your fields should correspond to each sensor that you’ve got. So we have a voltage sensor, a current sensor, temperature sensor, so those are our fields. And the tags are what we’re using to identify each data point. So clearly, we’re likely to work some query out the data for each particular product, maybe to show—excuse me, that was very loud.
David McLean 18:02.363 Sorry, where was I? Yeah. So we’re allowed to be retrieving a few days of data about any given product at a time, and so we tag each of our data points with the products ID that it came from so that we can query them out nice and quickly. And we might also want to tag data points with some more general information. So as an example here, I’ve got battery type, so you might want to look at data across all other particular kind of battery to see perhaps how that battery is performing. Or, for example, if we have an analysis that’s running that say only valid lithium polymer batteries and not acid batteries, we can just query the raw data, the voltage current temperature out straight by battery type and ignore the particular product that they came from. So there are some important things to know and plan about the scheme of design.
David McLean 19:03.142 So the first is that all of the identifying data about each data point should be held in the tags. So the things that identifies where it came from and what generated that data to be held in tags and tags should only have one piece of information in them at a time. So you shouldn’t be storing both the product ID and the battery type in a single tag, you want to split those out and that gives you really nice flexibility in your queries. You can always query with multiple tags, so I can ask for a particular product or I could ask for a particular battery type so long as the product ID was greater than 10, for example. So I can filter flexibly on any of my tags and keeping them separate allows us to do that. As I said actually, in this tag I’ve got here every point obviously have a timestamp. It’s a Time Series Database so every single point comes in with a timestamp. I didn’t put them on, I guess it makes them a little bit easier to look at here. So clearly you can also look by time for the last day of data, a week of data, a month of data. And one of the great joys I find about using Influx is that the query language often allows me to directly get information I’m looking for with a single query using query language. So I’ve got a few examples of why this schema works well for us and the kind of queries that we’re likely to want to run. I read these three example queries and you can see how this schema allows us to make queries about this.
David McLean 20:42.372 So to start with a simple select query. We just want to display the raw data coming from our product. We can select current voltage from the national and specify the product that we want and that’s great and good. And then maybe we’re interested in single aggregate numbers, perhaps we’re interested in what the average voltage was across all of our products. And perhaps you want to see how that’s changing over time and voltage doesn’t make so much sense in this context, but if we were looking at energy usage, we might want to look at the average energy usage per day for one of our products and that’s also nice and simple with this scheme of design. So here we’ve got voltage, but if we were looking at a different measurement with power in it, then we could select mean power or mean energy from that measurement and just “group by” the day. And then that would give us the daily energy usage across all units and we could see how that’s changing at the time.
David McLean 21:42.598 And then finally, so the filtering and thresholding allows us to run, sort of, alerting type queries directly at the [inaudible] from the database. So for instance, if I was to look at the current patterns of units that had bad voltages, say perhaps the voltage on a unit should never go above 14.1 volts. If it does, we’re interested in looking at the current output of those units because certain current outputs might indicate an actual unit failure whereas others would indicate that despite the high voltage nothing has gone wrong yet. So that’s the kind of query that we could do nicely from here so we could just—in this query we’re looking at all the current values from our measurement telemetry. Perhaps, for example, this analysis is only valid for lead acid type batteries so we can filter on that as well.
David McLean 22:46.556 Another thing which is nice, which I haven’t actually gotten to yet, is about the mathematics you can do inside measurements. So power is current times voltage, so if I wanted to generate power from this raw data I can select that straight out from the databases, select current times voltage from telemetry and it would return data values. So that’s another nice thing that’s particularly good for our use case where we want to store the raw values of the current and voltage but often we want to look at some function of those values, so in this case current times voltage. So having set up the database in a sensible fashion and with a sensible scheme allows us to query the kind of information that we want. It’s sensible at that point to set a good retention policy. I’m not sure whether everyone has a good understanding of exactly what retention policies are, so I’ll cover that very quickly now. Apologies if you already know this. So, a retention policy defines how long data is kept within Influx. So traditionally, originally Influx was often used as a monitoring service, so maybe server monitoring. And in that case, it’s often true that you don’t really want to hold your data for a very long time. Perhaps, you only want to keep a month’s worth of CPU usage, or RAM usage, or whatever it is, and retention policy to later automatically drop there and discard data that you’re no longer interested in. So you set the retention policy, and then you tell the data to write in for that retention policy, and then you specify how long you want that data to last for. And when data reaches the expiry time, that data, it drops. Retention policies are also a good way to separate out your data streams. So you could have a separate retention policy for logs and raw data. And then, you could keep logs for a certain amount of time and your raw data for a different amount of time. When you’re configuring retention policies, there are three main configurable variables, which this diagram shows.
David McLean 25:05.391 So the first one is duration. How long do I keep data for? And that’s shown in this diagram here. So this duration is dependent on the data that you collect. So for our raw data, for example, we’re actually interested in keeping that forever. There’s a good chance that we’ll want to look at long term usage trends, and perhaps do analysis on old data, or we’re looking at machine-related algorithms that’s useful to have. Actually, two full years of data for each unit to look at for training this algorithm. So the duration is really going to be determined by what you want to do with your data. And replication is fairly self-explanatory, where you can have a replication factor of your retention policy and that determines how many copies of your data are kept within an Influx. So, for us, for example, we don’t want to lose any of our raw date, so we have a replication factor of two. You bump that up to three if we’re really concerned about it. But for our logs, for example, we’re interested in the logs that come from the units. But if some of those got lost, it’s not the biggest disaster, so we just leave that one. We’re not too worried about replication of the logs.
David McLean 26:23.717 So the last configuration permit that we might look at is shard duration. So shard duration determines—and this is different to duration and that should be made clear. So duration, how long do I keep my data point in general? And then shard duration is, how long does each individual shard inside the database last for? So if your duration is 1 year and your shard duration is 1 month, then you will 12 shards in your database, 1 per month, and this determines the block size of information inside your database. And so, you could see that will also determine how much data you lose at the time. So if you have small shards, then you would lose a very small amount of data as it went through the threshold. So for example, if in our example we had 1-year duration and the shard duration of 1 month, then we’d have this 12 shards and each time a shard got over a year’s limit, it would drop a month of data at a time. The whole shard has to be outside the duration limit. So if it drops, so really in that scenario in reality, we’ll be keeping about 13 months of data in total as the shard drifted over the duration edge.
David McLean 27:45.616 So if you want to reduce the side duration, then you can help slightly by finding a control over exactly which data is dropped. But that comes with caveats, so having a very large number of shards tends to degrade systems performance a bit. And you can really increase the RAM performance, and the speed, and the compression of your system by using larger shards. So there’s a small trade-off there. And actually, shard size is one of the things that we run into some problems with while we were setting up Influx. So I’ll talk about that a little bit more at the end. But those are the three things that you need to consider if you’re looking at your retention policies, so the duration of the policy, the shard duration and the replication factor. So, how does that translate for us? I mentioned this briefly, but these are the retention policies that we hold right now. So for our raw data, that’s our telemetry data, we keep it forever, we will replicate the data twice and our shard duration is two weeks. So realistically we could probably increase that shard direction and we’d probably—it would be a useful thing for us to do at some time, although having already written data into it in these two-week chunks it’s now some effort to change that. But you can see this corresponds to us wanting to keep our data forever and we get relatively good performance out of this shard setup so that’s how we’re running at the moment. Similarly, we have a retention policy called Analysis RP. Analysis RP is where all of our derived fields go. And that also has integrations so we never want to lose that data and we do want to keep it around, so we have a replication patent on that and a slightly longer shard duration on that retention policy. There’s slightly less data in there so have a longer shard group.
David McLean 29:49.041 You might wonder why we have two separate policies for those things given that they look very similar. One thing that we found—so, I spoke about separating the data streams. If you want to drop data in the database, a very performant way of doing that is to simply drop the whole shard that that data is contained in. And if, for example, we had written some new analysis that was generating some points but there was a bug in our code and we wrote in some bad data point, we might want to just go in and remove those data points. If those all mixed up inside our raw data then we wouldn’t be able to drop the whole shard because we’d also drop all the raw data. And I’ll speak a little bit later about just deleting individual data points. That can cause some issues within the system and so it’s really helpful to keep those things separate so that if we do derive some data right in and then decide that actually, we didn’t want that we could just drop it from that retention policy without worrying about it getting caught up with the rest of our raw data.
David McLean 31:02.691 So then the last retention policy is the retention policy for our logs. Like I said, we’re not really interested in logs after more than about a year. Sometimes we do want to go back and look at what happened maybe eight months ago if we find out after the fact that there was some unusual behavior perhaps within the network systems and we want to see what happened, but more than a year we’re not too interested it. So we only keep them for a year and we’re not too interested in replicating that data. So we’ll talk a little bit now about the usage, the performance of our system having set up with those retention policies and that schema and that structure. So as I said, we used InfluxCloud which is the clustered hosted service that Influx provides. And so, for this service, we have a four-node solution. So we’ve got two large data nodes and two small meta nodes. And the meta nodes just keep data synced between the nodes and ensure that replication is held correctly and ensure that queries are sent to the right place. And the data nodes are where all of the processes take place. So our data nodes are 80 bytes of RAM and 2 CPUs, so nothing particularly large. And the system holds about 500 gigs of disk in total. So, in terms of databases, probably, I guess, small to medium-sized so far. But as the company grows, and as we generate more units, so currently we have about 30,000 unites active in the field and we have another 30,000 waiting to be sold. And we’re really hoping to scale out that number of units very rapidly in the next year or so. And so having a scalable, reliable system for handling the data that’s coming in is important.
David McLean 33:00.305 And one of the reassuring things about using the Cloud solution is that as our requirements grow, we can just add data nodes and meta nodes into our system, and there isn’t a hard limit on that. It’s not like a single large database system where eventually you’d run out of computer power or you’d run out of bandwidth. We can just keep adding nodes as our requirements grow. And so, with this system, so 8 gigs of RAM, 2 CPUs, and 2 data nodes, right now we have very beautiful usage. So our current system load sits around 30% generally. And if we start running some more intensive analysis, so that’s querying lots of data out and writing it in, sometimes it rises a bit higher, but we very rarely hit the CPU limits. We’re sat at about 40% RAM usage within the system. So most of our RAAM usage is due to our writes. We have about 3,000 writes a second and growing as we bring more units online. But this RAM usage is actually a bit higher than it should be for our system. So we have an interesting challenge with Influx. So the way that Influx is organized, they expect data to be written into the most recent shard in their retention policy by default. So it’s optimized for that use case. And for us, at the moment, that’s how almost all of our units work. But GSN coverage isn’t perfect in Africa and so when a unit drops out of coverage for a long time, it simply records data and holds it until it comes back in.
David McLean 34:52.309 And now that we have 30,000 or so units operating in the field, what we find is that the number of units that are checking in after 6 or even 12 months is actually quite high. Not high as a percentage but high in terms of we get definitely a few every minute. And that means that we’re actually writing across quite a wide range of shards, which is one of the things that Influx isn’t designed particularly well to do. So we’re really stressing out our system by writing over such a wide range of shards. And we’re sitting at about 30% usage and 40% RAM. So we have quite a lot of headroom to grow into in our system, especially if we can improve on the time range that we write data into. In terms of reading data, average request time’s around 100 milliseconds per request. And that’s a [inaudible] for all of the systems that we use so that’s encouraging.
David McLean 35:56.905 So this is the last section that I’ll talk about and it’s the things that we can’t do with Influx, or the things which might be a bit surprising if you didn’t have a good knowledge about the way that Influx is organized and what it’s optimized to do, and what it’s optimized to not do. So the first thing, which I just discussed is writing across a large number of shards. So I guess the first thing about this is that this is a good argument for not having a very large number of shards. But if you want to keep data for a very, very long time, you can just have a large shard direction. Have few shards in your database and then you won’t write across a large number of them. If you do have very small shard size then it’s important to be pretty clear that you’re not going to, either deliberately or accidentally write across that range of shards, because you will start to see RAM use increase. And you will have worse performance than you could do if you were regularly writing into the most recent shard.
David McLean 37:06.162 So the second thing that we’ve found that was a bit of a gotcha that we weren’t expecting is about deleting. So Influx is specifically designed to not be particularly optimal for deleting specific records. So the purpose of retention policies in shards is to allow you to drop large time ranges of data at a time, which is the standard use case for lots of time-series data application. But it’s really not very good at one particular time-series in amongst a large number of others. so, for example let’s say we have all of our units connecting and writing in the current time-series. Sorry, current as in current voltage. So if we have thousands of these units writing into that measurement and for some reason, we wanted to go back and not just drop one month worth of current values, but just drop the particular current values for a unit, the way the Influx handles deletes like that is to write tombstones records. And that means that deletes are actually similar to writes in terms of the points, and at that point you’re writing across a large number of shards because it’s a huge time range. And then you’re back to the used to get where you can have some troubles with running incisions. So important to avoid large deletes across ranges of time like that. And there shouldn’t really be many reasons for people to want to do that, I don’t think. We certainly found that we could avoid all of our deletes in that format just by understanding what retention policies were for, how to configure them for our use and using them accordingly.
David McLean 38:56.763 And then, the last time we found is about particularly large queries. So when you’ve got thousands of units all connecting and writing raw data into a particular measurement, you end up with each measurement having a really large amount of data stuck inside it. And it’s relatively easy to just try and query that data out, sometimes by accident or if you didn’t consider the implications of your query. So for instance, if you were looking for the voltage for a particular unit and start writing, select voltage from telemetry but didn’t include your filter, then Influx is going to start trying to return all of the voltage data for all units across all time which is going to really stress out the system. And a property that we found in Influx is that Influx tries really hard to execute whatever query you asked of it. So if you’re writing in really large amounts of data, even if you’re battling data, it will attempt to do what you asked right up until the point where the system fails. And so, a large query like that can cause a system panic and you’re setting up your system to ensure that it’s not possible for our users to accidentally or deliberately query out huge amounts of data. If you’re running query in small amount of data at a time, if you’re looking at perhaps a single unit over a few days or even longer, we can happily look at a year’s worth of data for a product with in a single query with no problems. But large aggregate queries can cause issues and can be difficult to tell from the user side, certainly if you’re using Influx cloud and the host rather than running the open source supply yourself. So those are three things that we learned by failure and that we learned by finding that we have problems and then finding out why. Hopefully, that’s helpful for everyone else. And maybe that will prevent you also doing these things. And that’s the end of what I’ve prepared. So thanks very much for listening and I hope you enjoyed it and feel free to ask me any questions.
Chris Churilo 41:22.511 That was really great. Thank you so much for that very informative talk. What we will do is I’m going to read out the questions from the Q&A section so that everyone can hear them in better— hear live at this webinar, or if you want to listen to this again later on you’ll have these Q&A section as well. So the first question we have from Dhalid is, “Did you consider using Kapacitor instead of Python scripts? And if yes, what were the reasons for not using Kapacitor?”
David McLean 41:53.300 So when we originally looked at Kapacitor, it looked like it was going to be quite a struggle to organize the really complex analysis that we wanted to do. So Kapacitor looked really great for nice simple things like running thresholds and even slightly more complex stuff, but our battery state of health in Africa, for example, it has some relatively complex mathematics in it and we didn’t see that it was going to be possible within Kapacitor. In addition, we needed the analysis to call out to our internal APIs to look at the information for a product because, for example, the threshold limits are dependent on what kind of hardware we have in the product, what version of the product it was, maybe what country the product was in and who we sold it to. So, all of those things made it easier to just run in Python rather than get involved in Kapacitor. Also, we had a lot of expertise in Python and no expertise in Kapacitor. So all of those kind of pointed us towards using Python scripts. We’ve have, although, recently used to find functions which make it a lot easier to do the kind of analysis that we were interested in. At that point, we’d already set up all of our Python infrastructure, our Task Manager, etc., etc., to handle things. So it’s something we’re interested in in the future but it’s not something that we’ve looked at right now. In terms of other people’s use, I guess it really depends exactly what you’re doing. If you’re looking at alerts and monitoring based just on the raw data and you don’t have an existing setup, then I would probably look at Kapacitor first since their capital this year has integrated well with Influx. And I don’t really know how easy or difficult it is to make Kapacitor do more complex stuff and machine landing-type things, implementing various machine and models that he might want to use in the data. And I don’t know how difficult it is to get it to play nicely with other APIs. For instance, if you’d want Kapacitor to go and find thresholds from another surface and then apply those, I don’t know how tricky that would be. So that’s how we approach the problem.
Chris Churilo 44:26.034 All right. The next question is, what is the data frequency for the BBOXX controllers? Is it per day?
David McLean 44:33.128 Okay. Sure. So data frequency, the systems connect once every four hours, theoretically, but they record data at a much higher rate than that. And the rate is actually variable. So at its lowest rate, when the box thinks that nothing particularly interesting is happening, it records, I think, once every 10 minutes. But as soon as there’s activity, so rapidly changing voltage or current, which probably corresponds to either charging or discharging, people connecting appliances, then it starts to record much more frequently. I think once every—I think in 10 minutes of output. So I hope that was your question. I feel like maybe we’re also asking how much data we’re getting. So in a raw format, we’re getting—in a really compressed format, we’re getting about a megabyte per unit per month. And we have—like I said, 30,000 units or so out in the field. But that gets heavily compressed by the Influx compression algorithms. So when you put things into Influx after it’s been there [inaudible] running the backgrounds and compressed data down, we see quite high compression ratios. I don’t actually have a figure on the exact compression ratio that we get. But we generally are well below any of the data thresholds, and this space isn’t [inaudible] our system.
Chris Churilo 46:06.638 Awesome. All right. Next question from Dhalid, how do you handle the time zone information in your timestamps? Do you convert all the timestamps to UTC? And how do you query for, let’s say, a whole day in local time?
David McLean 46:19.609 Yeah, so all the timestamps go to the UTC. We figured it was just going to be a lot easier to handle absolutely everything in UTC all the time and to have that as a complete, sort of, a fixed rule within our systems. That does have the issue about local time, and then it’s just a question of whatever system we’re using to query that data out. We just make sure that system knows about the time zone it’s in, if that makes sense. So for example, if I have a Python script that wants to get one day of data, one day meaning midnight to midnight in local time, we would just make the Python scripts find out when that local time day was in UTC and then query it UTC. So—
Chris Churilo 47:16.526 Great.
David McLean 47:17.070 —if we’re looking for Rwanda midnight to midnight and Rwanda is a plus two, then our script will query plus two to plus two on various days.
Chris Churilo 47:28.148 Okay. That makes sense. And Jeff asked, “What kind of data do you store in log _rp?”
David McLean 47:35.906 And so that’s string and text data from the units. So the unit firmware has a log and it sends the log string when it connects and we just store those strings. That’s searchable, so you can search for that specific part of events if there are events that relate to areas or especially firmware. And so we can look at those logs and display in the free unit. And we can also have a look at, for instance, how many units might have a particular firmware error. So that’s quite nice. It’s just occurred to me, one thing probably that might really be helpful for everybody to know about, this is definitely a gotcha that could come up. When you write into a shard, you’re only allowed to put one data type in each field. So if I have a field of voltage, I can only put one data type in there. So if I start writing floats in there, I can’t start with the integers. I’ve got to start putting strings in there or integers. And so what that also means is that if for some reason, you’re sending a kind of combination of floats and integers and you happened to start a brand-new shard and the first thing you put in is an integer, after that, all the floats will be rejected. So that is an important point. Make sure that whenever it’s sending you data in Influx, you are really certain about the data types that you put in there, otherwise you can have issues.
Chris Churilo 48:55.174 Jeff also asked, “Why did you use Influx for text-based log data as opposed to Elasticsearch?
David McLean 49:02.902 Influx is what we had set up and the log that the units were already connecting to our system which pipes data onto Influx, we already had it set up. We didn’t have any particular disk problems or any reasons not to just shove the logs in there, so we really didn’t want to set up a completely separate service to run it. And I guess it’s currently worked really well so far for us. So there probably will be a point or there might well be a point where we start to think of how this isn’t working so well for us maybe we would drop the use of classic search or want to see other options, but we haven’t hit that yet. So there’s really been no reason to change.
Chris Churilo 49:41.138 And then Mikel asked, “Do you serve all the data to the end users of the BBOXX product?
David McLean 49:47.628 So the users of our products are rural African’s and they probably don’t have any interest in seeing particular use of BBOXX, and even if they’re dead, they’re unlikely to have the kind of connectivity requirements to see it, so we typically are not selling to people who have good Internet access. They might have it on their phone, but it would be quite tricky for them to look at it. We could do it—we commonly don’t only because we’re not aware of any demand for it. So the people who do show are the data to are in internal systems to the monitoring and [inaudible], our business development for looking at how single used, our technicians for looking at potential problems in the unit that’s coming for repair. So units come in for repair when they break, and we don’t know why, we can go look through the previous data, so things that make news cases. There’s a last one which is, in addition to selling to customers, we also sell to third parties, so there are other companies that use our systems. For instance, we have a company in Pakistan who are using our solar pumps to power water pumps—sorry, solar panels to power water pumps. So they can access our data using our API and then they can look at the usage of their pumps and see what’s going on there, so customers on the ground unless they’re a third-party company, they don’t really see that data.
Chris Churilo 51:24.087 Okay, another question from Mikel, “Did you consider using Kapacitor for continuous queries?”
David McLean 51:31.254 So, yeah, I think that question is, “Did we consider using continuous queries in our system?” Yeah, we know about them, we haven’t found a particularly obvious use case for them yet. As in, there’s nothing that we want to do where a continuous query seems like the obvious choice. There’s a reasonably strong possibility that it might be in the future. So for example, at the moment, I think I said before that we look up power quite a lot, current times voltage, at the moment, we’re just querying our analysts’ current times voltage, but we might, in the future, decide that we want to keep power directly and not run that multiplication every time. So you might run a continuous query of current times voltage straight into another field, and so far, we haven’t really had the requirement to do that.
Chris Churilo 52:29.303 All right, and Javier asked, “Can you visualize data with an Influx like the charts shown at the beginning of your slides? If not, which tool do you recommend to connect to InfluxDB?”
David McLean 52:40.231 So if you use Influx’s graphing tool, called—I forgot the name there.
Chris Churilo 52:55.370 Chronograf?
Chris Churilo 54:09.202 Alright. Another question from Dhalid, “Are you giving false incorrect data measurements from the panels? And if so, how do you deal with them? And are you trying to detect them?”
David McLean 54:19.828 Yeah, this is good one. So something that’s interesting, so the panels have—they sort of know what the time is. But sometimes if they haven’t been connected to the system for a long time they forget what the time is and then they default to either sometime in 2014 or they default to [inaudible] zero, so 1970. And when you do that it then writes data and it sends the data with those timestamps. And then it tries to write into these hundreds of shards which is, and as discussed before, presents problems for Influx because it starts using a bit of RAM also. So we have a couple of filters checking the timestamps. Every so often we’ve got stuff writing timestamps in the future. It’s not clear how that happens yet. So we’re detecting those really obvious ones. In terms of bad voltage in current readings we don’t try and detect those at the time. We think that filtering out the raw data like that can be a bit dangerous. You could end up chucking out good data or missing data that is real but it’s just unusual because you saw so much bad. And something that we do, as data comes in, is scan it and look up whether we think it matches a kind of, what we would not call a normal discharge. So over the course of the day, does it sort of look like the data and it’s kind of good data or “bad” data? And bad data often happens for instance if one of the sensors strike, and then what you get is instead some centrally varying current in voltage, all the values just fit 50.4 exactly what we time or something. So we keep a record of all of the day to day that we think are good data. And then when we want to run training hour we can only run it on data that we think is good. But we don’t delete or prevent incorrect data going in unless it’s a timestamp which is sort of definitively wrong.
Chris Churilo 56:34.806 Very nice. All right, I think we have gone through all the questions that we have in the Q&A section. Apologies for some of the technical difficulties that we had here. But what I will do is we’ll make sure that we clean up this recording, put all the questions in there so you can also read them as you listen to David answer them. And then I would encourage everybody if you have other questions just go ahead and email me and I’ll make sure that I get it to David and so we can get those answered to you guys as quickly as possible. David, I have to say, I really appreciated this talk. I think that myself and everybody that was on this call today really appreciated the depth of knowledge and also the details that you shared with us. And I think it’s really going to help out a lot of other people. So with that, I’m going to say, thank you again. Thank you to everyone that attended our session and thank you so much, David, for being a really fantastic speaker.
David McLean 57:35.118 Thanks very much, Chris. And thanks to everyone for listening and good and interesting questions asked.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.