How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor Anomaly Detection by Using InfluxDB
Session date: Dec 01, 2020 08:00am (Pacific Time)
Ezako is a startup specializing in time series analysis. Ezako helps its clients detect anomalies and label their time series data. It helps accelerate the labeling process and analyze vast amounts of data from a variety of sensors in real-time. The company provides anomaly insights and makes it easier for data scientists. Ezako is the creator of Upalgo, which is a time series data management tool that uses AI to automatically detect anomalies in streaming data.
During this webinar, Ezako will dive into how high-frequency sensors can generate huge amounts of data which can become desynchronized. This can result in data quality issues as it can contain errors and glitches. Ezako uses machine learning, labelling and feedback loops to identify these errors. Discover how the company helps improve its clients’ data quality and reduce the number of validation mistakes.
Watch the Webinar
Watch the webinar “How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor Anomaly Detection by Using InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Improve Data Labels and Feedback Loops Through High-Frequency Sensor Anomaly Detection by Using InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Julien Muller: CTO, Ezako
Caitlin Croft: 00:00:04.410 Hello, everyone. Welcome again to today’s webinar. I am very excited to have Julien from Ezako here to talk about how they have used InfluxDB in the back end of their platform, and they are using it to improve data labels as well as feedback loops. Please feel free to post any questions you may have in the Q&A box. We will be monitoring that throughout the webinar, and all questions will be answered at the end. So without further ado, I’m going to hand things off to Julien.
Julien Muller: 00:00:41.881 Thank you, Caitlin. So let me share my screen. So I hope everyone can see my screen.
Caitlin Croft: 00:00:53.432 It looks great.
Julien Muller: 00:00:54.953 Perfect. As Caitlin said, we are going to talk today about specific activities in machine learning on time series. So data labeling and feedback loops, so how to integrate the information we get out of labels into our machine learning process. So I’m going to first introduce myself. I’m Julien Muller. I’m an AI expert. I’ve been doing big data and AI for the last 20 years. You can see more details on my LinkedIn profile. I am currently the CTO of Ezako. Ezako is a time series machine learning analytics company, and I created a product called Upalgo which is a platform for anomaly detection and labeling your data. So basically, Ezako is a company in France. We have our main office in Paris and a research center in Sophia-Antipolis. So Sophia-Antipolis is like the French Silicon Valley where all startups are located. You can see in the picture our office, so a lot of sun. A lot of light. It’s a little bit like California, actually.
Julien Muller: 00:02:26.323 So what we do Ezako is really machine learning on time series. We decided to concentrate and focus only on this specific subject because it’s actually big, and it’s going to be much bigger. We mostly work in aerospace, automotive, and telco company. So, basically, there are many sensors out there, a lot of complex objects. So you could see a satellite as an IoT just producing data, and you can actually use this data to monitor - use the telemetric elements of the data to monitor this IoT or also very small devices like a telco component in the infrastructure. And if you look at the future, I mean, time series is going to be much, much bigger. There are tons of IoTs and sensors right now, but it’s nothing compared to the future. We’re talking about trillions of sensors available in 2025. So this is a problem, actually because when you have a lot of data, you have to be actually able to analyze your data. And that’s why I built the solution with the Ezako team that is designed to provide tools to handle these large data sets. So basically, what we want to do is identify interesting events in these very large data sets. So we are talking about anomaly detection. It’s a very large field of machine learning. But if you only focus on time series, it becomes very specific and not that much covered, actually.
Julien Muller: 00:04:22.886 So what is specific with machine learning on time series? Well, the first thing is that time series are huge. I mean, you can find a very small time series. It’s possible. But if you look at the raw data, field data, it’s satellites that are coming in. It’s IoTs that are able to provide data at very high frequencies, maybe 1kHz, maybe 50kHz, so 50,000 points per second. And what is very specific definitely with time series is that the temporality of the data matters a lot, which makes most of the tools not usable, in fact, because when you have one data point, it’s really connected to the previous data point and to the next one. Also, a big issue we have with time series is that it’s not really possible to know the ground truth. So when you do machine learning, you’re using this grand truth at least to learn, to know if your results are good or not. But with time series, it’s not really possible, and I’m going to give more details about that afterwards. So yeah, since I have this screen, you can see a little bit more about what is Upalgo about. So it’s really about visualizing time series, building machine learning models, look at the anomaly, add labels on the data and relearn and improve the results based on that.
Julien Muller: 00:06:10.679 So a little bit of history really why we decided to use InfluxDB as a time-series database for storing our data. So it’s been quite a long time now that we use InfluxDB, almost four years. But in the past, I’ve been using many storage systems for time series. I started a long time ago with relational databases. I went through NoSQL databases when they started to be available. I’ve been using Hadoop, OpenTSDB, new NoSQL databases that are adapted for time series, but I always had a lot of issues with that. So I mean, the first thing is, as I said, it’s big data. It’s big data because, well, sensors send a lot of data, heterogeneity is really high. I mean, each data set is different. Yeah, IoTs provide usually high-frequency data. Historical systems were quite slow for this kind of data, and you want sort of real-time data. I mean, it depends on the use case. I can provide more details on my use case a little bit later.
Julien Muller: 00:07:38.668 And with time series, you need features that probably no one else or any other use case doesn’t need. So things that are only designed in time series systems like windowing, [inaudible] features on your windows. Plus, you need a community that is able to help you when you have technical issues. But not only a large community, a community that has the same focus and issues to solve as yours. So basically, the main reason we decided to use InfluxDB is that it’s really a time-series database. It’s designed for that and probably only that. The point being, really, I’m doing machine learning and I’m a data scientist. That’s what I do and what I want to do, and I don’t want to do a time series storage. So if I use a system that is not exactly designed for that, I’m going to spend a lot of time building schemas, managing time frames, for instance, managing nanoseconds at some points of time and also data sets seconds. By design, the system will not be done to perform really well in my specific context.
Julien Muller: 00:09:08.975 While here the system is designed for time series, the focus on performance will be really close to what I exactly need. Plus, one thing that is really important for me is that I don’t need to build a schema. I spent years building schemas, and I don’t want to do that anymore. I get new customers every day. They all come with data. It’s really hard to anticipate what they come with, and as I told, I don’t want to spend hours and hours building schemas for the new data every day. I want to focus on the actual machine learning.
Julien Muller: 00:09:55.218 So to dig a little bit deeper in how we use InfluxDB, I met a small architectural schema. So you have to understand it’s really simplified because I wanted to put the emphasis on really when we actually insert data into Influx and when we use it. So the first challenge is really continuous data inserts. So we get data from IoTs, machines, satellites. So it’s coming in at a very regular pace during the day, during the night. It’s always, always writing data in the system. And at the same time, we do processing. So what is specific with machine learning is that you basically do several kinds of read and writes. Maybe the most obvious one is when you learn a model. So if you want to build a model, you have to learn on data. So you have to read the data from the database. This is usually a very big read, and it has to be quite fast. And also, you have other processes running. Usually, you have to clean up the data, look at it, build metadata. So this is read CPU rights. And also, at the end of the process, you want to read very recent data for instance for anomaly detection or for prediction. You take the last chunk of data in order to predict the next value.
Julien Muller: 00:11:42.223 So this is really where InfluxDB helps us. Basically, we choose to use a REST API to make our queries so that we have a common layer and one problem to solve to read the data even if we have several technology stacks on top of that including the user interface, distributed machine learning processes running. Also, yeah, one thing I didn’t mention and it’s quite important is that when you have user interfaces, you have to show time series graphs to the users. And it’s a very small query compared to the others, but you have a user waiting for you. So in this case, it has to be fast. You cannot make the user wait like three, five seconds in order to see a chart. And every time he zooms, it has to be quite instantaneous. So a regular time series would be like 20 million points. So you have to find a way to actually show this graph quite fast.
Julien Muller: 00:13:06.160 So now I’m going to be a little bit more technical about machine learning, and I’m really going to try to explain what are our issues and how we solved it or how we are trying to solve it. So as I told, inserting data at a continuous speed is really impacting the system. So you always have to take that into account because you have to do other things on top of that, but you’re not going to stop inserting data. Also, your series are big because you need big series. For instance, for anomaly detection, the most common algorithms or basic algorithms would be a One-Class SVM or Isolation Forest. So as a rule of thumb, I need at least one million points to learn. Why that? If you have one million points, if you decide to calculate features on 60-point windows, which is quite standard for seconds or for hours, you end up with 15,000 windows to learn from, which is not much, in fact. So you really need a lot of data to make a lot of queries, to get a lot of data points in order to make a good learning. Plus, in this example, I was mentioning 15,000 windows, this would be more like a minimum because if you do anomaly detection on multiple series, so Isolation Forest and maybe two or three series, you have to double or triple that. So we’re talking about two or three million points.
Julien Muller: 00:14:56.300 And if you use other algorithms like LSTM based on the raw data, here you have also to learn the characteristics of the data that you don’t get out of the features. So now I’m talking about 5 or 10 million points just for the learning phase. So really huge data sets. And also, a big issue we have - and when I say we, it’s not only at Ezako, but when you want to do good anomaly detection on time series, that you cannot know the truth. To be more specific, if I have an algorithm that finds anomalies, you can tell me if it’s correct or not. If it’s a correct anomaly, I can point the anomaly straight on the screen and you can say it’s a true positive. It’s quite easy. You see a peak, you say, “Yes, it’s an anomaly.” You don’t see a peak, you’re able usually as an expert to say, “It’s a false positive. It’s not an anomaly.” Also, if you look at just random noise or a chart where nothing is happening, you’re able to say it’s a true negative. You didn’t find an anomaly there because there’s nothing to be seen. But the case of false negative is much more complicated because when you have 20 million points, there’s no real proof that there’s an anomaly hidden in this. So there is a big issue with knowing all the truths, and this is the reason why it’s really hard to use classification or supervised algorithms. So you’re stuck with unsupervised algorithms or you have to find a way to increase your knowledge on the data and have more labels.
Julien Muller: 00:16:53.699 So this is why you really need to have an anomaly detection workflow that is very complex, evaluate it so that you can end up with better knowledge on your data because anomaly detection is really about adding more information on your time series. So, as I told you, the bare minimum is to be able to insert data, then you usually calculate features. So it’s not totally mandatory because some algorithms I said can bypass this step, but it is at a price; you need more data. So quite commonly, we do calculate features. So these features are built on windows of time. And basically, it requires reading the time series, the entire time series, I mean, do calculations on windows, and then write the time series. Maybe on another slide later I can describe why we decided to write this time series because we use InfluxDB not only as a storage system for time series, but also sort of a caching system for intermediate time series that we use afterwards.
Julien Muller: 00:18:28.677 So once you have calculated features, you really need to understand your data. So at this point, understanding your data is really having a user looking at the data. And it’s a technical challenge in fact because, how do you do to show 20 million points to a user? You could zoom in on part of the data, but then it would miss the big picture. So our customers are really requesting to see the entire time series, to have an overview of the series. And this happens at many steps. So how we do that, we started with manual techniques. It didn’t work that well. Now we’re using the sample of the InfluxQL, and actually, it works really great because it’s really fast. So we can show a chart of - or let’s say you want to show between 2,000 and 5,000 points on a chart to a user so that he can actually read it and it’s not too heavy in terms of navigation. So based on your 20 million points, you can sample the data with InfluxQL and get this data quite fast.
Julien Muller: 00:19:53.453 It’s not perfect since when you do anomaly detection, you don’t fund random points. You want interesting points to come out on the picture. But on this, we don’t have the perfect answer right now. There are other options we’ve been investigating like built-in downsampling manually by also using InfluxDB downsampling feature. But it’s not really answering to the issue because we don’t want to lose the interesting point. So we cannot just say, “I want the max value in this time frame or the mean value or - it has to be some complicated algorithm. And we’ve been going through this for a lot of time, and up to now, the best we could find was the sample feature because it’s really fast, and at least that’s important for the user.
Julien Muller: 00:20:56.354 So, as I told, we store our raw data in InfluxDB because we need a reference. Some algorithms can go directly and use the raw data, but in practice, you always have to clean up your data, do some adjustments, modifications. Maybe you want to normalize your data. And since it’s large data sets, you don’t really want to keep them in memory. You have to do something with this data. Also, if you can create features, there’s a very high probability that you’re going to reuse these feature windows, especially if you’re trying to iterate on different algorithms hyperparameters. So we decided to use InfluxDB as sort of a cache storage. It’s not stored in memory, but in fact, there’s no good solution for storing large amounts of data in memory, so finally, it’s quite good, and we can reuse the series as much as we want.
Julien Muller: 00:22:13.100 Also, here, using retention policies, you can really manage your cache. So finally, it’s not such a surprising idea because you can manage your retention on this. And also, as I told you, we were not really able to use downsampling because of our use case. We’re doing anomaly detection. So we don’t want to downsample the data because we really want to have the maximum granularity because maybe the anomaly is only one point hidden inside of the data sets. Also, if I want to relearn, if I downsampled my data, well, I’m not learning on the same thing, so I cannot really make any use out of this. So yeah, to give the really big picture for the anomaly detection workflow, starting from the raw data, you try to extract metadata, like, is there any seasonality in my data? Is it stationary or not? You want to calculate features on your data to make sure you choose the correct time window, for instance. At this step, your user wants to visualize the data, so it has to be fast, so restoring the data is a good idea. Then you can learn and detect anomalies.
Julien Muller: 00:23:45.942 Oh, a small tip. Something we’ve been using for years and that has been really, really helpful and simple, like it’s really good, it’s very simple but very powerful, is using naming convention on measurement names. In fact, like all the cache system I just described is done in name conventions on the measurements. So we can just query, and if you don’t have the data, we know we have to recalculate it. So we don’t have anything to manage except knowing how the measurement names are built, like the actual measurements, the window size, and the feature you’re calculating, for instance. So yeah, just to finish on that, all this we do it to detect anomalies. And once we’ve detected anomalies, we really want to improve these results because this is a very statistical approach of anomalies. Like, what is an anomaly? Here, it’s an outlier. We statistically decided this point is different or this data point set, this set of points event is different from the rest of the time series. But maybe our user has a different opinion. I mean, as an expert, you don’t really see an anomaly where a system sees this anomaly. So you definitely need to do something on top of that. And that’s why I have an arrow going nowhere because I do need a feedback loop. I need to use this result in order to reinject it in my system to detect better anomalies or to detect what the expert is expecting as an anomaly.
Julien Muller: 00:25:46.731 So this is why we really need labels. So labels is really extra information I want experts to add on my data so that I know more about the data so I can do better things. So the most simple label is, if I have labels on my learning dataset, I can remove the anomalies from the learning set. In some algorithms like One-Class SVM we love not having anomalies in the learning sets. Also, having labels is really something I can use to calculate the quality of my detection and then maybe compare several algorithms or hyperparameter configurations so I can use it to actually assess the quality of my algorithm. And if I can assess the quality of my algorithm, I can make a better algorithm. But labeling is really a really hard task. I mean, I’ve been doing that, so if you try one day spending an entire day just labeling data - no one wants to do that actually. It’s really boring, and the more labels you put, the more chances you have that your label is wrong because, at some point, you’re just tired of this very repetitive task. Basically, you’re looking at your time series and you’re saying, “Oh, yeah. This is an anomaly. This is not an anomaly. Oh, yes, this is normal. For this, I don’t know. I need two series and compare it to have a better idea, understanding of the program.”
Julien Muller: 00:27:34.028 So the point being, doing some labeling is okay, but you don’t want to do a lot of labeling. But your algorithms want a lot of labels. So this is a very big challenge we’re trying to solve is - this is a real case scenario, but we are doing that every day, is how you put 20,000 tablets on 20 million data points in a few minutes. And so the answer we came with is really in two phases. So the first one is to really help the user label the data. We didn’t find any user interface that was ergonomically designed for that that was getting the power to the expert to see really fast the data, understand it and labelize it in a very, very short amount of time. I’m talking about seconds because I need a lot of labels. But it’s not enough. And the point is, how do you make more labels? Well, you can use artificial intelligence, and that’s what we’ve been doing.
Julien Muller: 00:28:52.149 So yeah, basically, as I said, so if you want to go for a supervised machine learning classification, you really need labels. Any information you can add on top of the data is very usable, very important. So if you can, you should really increase the value of your data by adding information on top of it. And manual labeling is really a pain. You don’t want to do it. So just to dig a little bit into the ergonomics, so the first strategy is to have a better interface to speed up labeling. So you want to reduce the amount of clicks the users are going to do. So I have a really small thing that helped us a lot. One specific customer was labeling a lot of data, and he came up with this idea of having a button which is Confirm and Next because he realized that in these data sets, we did some fast anomaly detection, and the results were quite good in fact. Which means he was seeing a lot of events, which is actually true positives maybe 95% of the time, and sometimes the events were not correct. So having a confirm and next button was giving him the ability to put a positive label within like three or five seconds and to stop and think about cases where it was natural or negative cases.
Julien Muller: 00:30:42.834 Also, as I said, when you do labeling, at some point, you’re starting to make mistakes. So it has been studied a lot, like when lunchtime is coming or when you have been labeling data for two, three hours, you start to make mistakes. And actually looking at the patterns, when we see events that are very similar in the same time series, if we see that one event has been considered as a true positive by an expert and another event very similar is considered as a false positive, well, we can identify that using algorithms. So this upsets a lot because it’s about trust. The user knows that he can make mistakes, and we’re going to help him point these mistakes afterward. This increases the quality of the labels, which is really key for the next step.
Julien Muller: 00:31:45.228 Yeah. Also, in this example, I’m talking a lot about true positive, false positive. But when you’re labeling, sometimes you’re looking at improving your algorithm, so you want to get a score. So true positives and false positives are going to help you. Sometimes you want to label data to do something different like classification. So your label can also be classes. Like, in a cardiogram, electrocardiogram, you could label arrhythmia, cardiac arrhythmia so that you can extract this data afterwards and make a classification model to identify if results have an arrhythmia or not.
Julien Muller: 00:32:43.073 So when you have user labels that you expect inserted, once you’re quite sure about the quality of these labels, it’s not enough. It’s just not enough because, in machine learning, we need as much data as possible. So we came up with an approach where we try to propagate the labels we have available. So an expert cannot go through an entire time series and find all the anomalies there is so that we can learn on the best-labeled data sets. So we built algorithms that go through the time series and find similar patterns. And in order to propagate labels, basically, we increase the labels available. So every time an expert puts a label, we are trying to find 20 seminar events in the data set so that we can put more labels. And when the expert sees we can just multiply these labels, he’s even more motivated to add more labels because he sees that one click for him is like 20 clicks, 20 labels afterwards.
Julien Muller: 00:34:11.715 And by the way, this is a very resource-consuming job, so these kinds of things have to be distributed. So the way we distribute these kinds of calculations is to make chunks of data on the time series. So basically, we have this rule of thumb to try to make segments in our time series chunks that are around one million data points so that we can distribute the jobs easily. So we try to make sure there is sort of an overlap on the chunks so that we have some points before the actual chunk and afterwards that we can cover like time windows that don’t fit perfectly to the chunk. And it’s small enough jobs that can be distributed because it’s really easy to parallelize these kinds of things just based on this window.
Julien Muller: 00:35:22.796 Also, this is very use-case dependent, but sometimes propagation could be just only finding exactly similar labels, or it could be a need of hundreds and hundreds of labels. And the way we address that is we are trying to, every time we are doing something - every time the AI is doing something by itself, it’s trying to associate confidence value to that so that we can put a threshold and say, “Well, in this case, I want to be very strict. I just want very similar events, where in another case, I try to be more flexible.” So you can see that the color is not - it’s green but it’s getting darker and darker the closer it is to the actual events from the user. Our second challenge is, based on all these labels, we want to reinject this knowledge into our system. So basically, we want to do a better learning data detection based on these labels, so get this information back through the workflow I just showed you.
Julien Muller: 00:36:53.369 So this is really user feedback. The idea is to be able to relearn continuously. Sometimes, it could be very simple but very helpful, like if I use a [inaudible], I have a threshold. If on this new upcoming data that is a little bit different, the experts find false positives - maybe three of them in the same day, we can actually take that into account and change our hyperparameters on the [inaudible] so that we move the threshold, and we are going to be able to stop creating false positives. So we can learn better the normality even if in this case it’s not about learning, it’s really about understanding the normality of the data. Well, you can also use it as a relearning process if you first learnt with some that are older that are available but you would like to learn on more data or more recent data. Every time you catch an anomaly on One-Class SVM, you can remove these windows from the learning process so that you get a better model, a model that represents better the normality so it can detect anomalies better.
Julien Muller: 00:38:16.991 So definitely, the UI is really important in this context because it’s really the interface, and by interface, I mean the way to communicate with the experts, so that the higher the quality is, the better the result will be. So you don’t want to add information on top of your time series that is wrong or partially wrong. So the higher the quality is, the better it is. So you really want to show to your expert exactly what’s going on. On the screen, for instance, probably the anomaly is correct but not well segmented, not well centered. And providing that to your expert gives you the possibility to - gives the expert the possibility to fix it so that we can send feedback to the process so that it can learn better.
Julien Muller: 00:39:16.870 By the way, yeah, we use these labels to build scores. And actually, calculating scores on time series is not really an easy task. It’s been really well-covered for other kinds of data. But with time series, an anomaly, it’s not a data point. It’s more like an event. So it’s a period of time. An algorithm can find a certain period of time as an anomaly, like in this case, an [inaudible] roll-up. So in this case, I’m not 100% satisfied of the result, but still, it’s quite good because it found the anomaly. And if you want good optimization, you have to come within a scoring system that takes this into account as I just said. So also here, as I told, as you have a score, what you want to do is maximize your score. So here, there’s a lot of calculations to be done, and you want it to be distributed. So really building chunks on the time series you’re getting from Influx is really a super simple way to distribute any algorithm.
Julien Muller: 00:40:53.062 Okay. To briefly sum up all this, well, over time, I realized that time series labeling and taking into account the user feedback is a very, very complex and difficult task, and I don’t want to spend most of my time managing how I store my time series, how I get it, how fast it is, how I’m going to solve these kinds of issues, and this is really why I want to use a really good time-series database that has been designed by people who have the same issues as I do. And I want the users of the database to be, like me, have the same programs as the one I have because I want the community to help me solve the technical issues I encounter. So also, I will have to say a user-friendly UI is really key. So we’ve been trying to use user interfaces on the market, and what’s available with InfluxDB is really great if you want to explore your data sets. But in our case, we really want the users to labelize data really fast, so by design, it needs a specific interface. So I believe everyone should try to use what’s available, and if the need is too specific, answer to the actual customer needs.
Julien Muller: 00:42:30.387 Yeah, use as much algorithms as you can to help the user. This is really what AI is about, is intelligence to actual experts to empower him. And finally, yeah, I have to say what I like in InfluxDB is that it’s pretty smooth. I do have technical issues, but globally, way less than before, way less than other systems I was working with. I just want to plug it and forget about it. Another example of that is on dev platforms, I mean, the InfluxData container is very simple to deploy and to use on any dev environment. I mean, in three minutes, you get it. You can backup and restore the database, so a data scientist can get all the data he needs within three or five minutes. And it’s really nice because we don’t want to think about this. We want to do machine learning on the data.
Julien Muller: 00:43:39.479 So I just wanted to introduce this subject. I don’t have time anymore. I think that we’ve been really thinking about migration to InfluxDB 2 that has been released recently. We looked at IOx briefly, and especially at Flux that’s been out there for quite a while, and we didn’t move. So I think this is a big step anyway because we have a system in production with customers. But we started to look at it, and I’m willing to plan some migration within six months, which means that within something, like, three months, we should be ready to start migrating what we have in our code base.
Julien Muller: 00:44:32.273 Well, that’s my presentation. Maybe you have some questions for me.
Caitlin Croft: 00:44:36.277 Thank you, Julien. Definitely have some questions. What problems did you have with Hadoop? I know you were talking about all the different solutions that you had tried out before using time-series databases. So what were some of the challenges with Hadoop?
Julien Muller: 00:44:56.179 Okay. So Hadoop is a very large ecosystem. So originally, it’s really a distributed file system. So it’s not designed for time series. I mean, really, if you try to use it, the file system is going to be very good for storage. I’m only talking about the time series analytics in this case. Once you store the data in CSV files, well, you pretty much have to do everything. So I would have to design a lot of things in order for it to work. Then you could use HBase, which is a database coming from the Hadoop ecosystem. But here, you have to redesign your entire system for time series. It’s not done for that. And in this direction, lots of people tried to build products on top of that. It’s been really explored by many people, and it’s just so much easier to use a time-series database.
Caitlin Croft: 00:46:05.002 Can you provide more details on why the no-schema feature was important to you, and can you share a little bit about the experience?
Julien Muller: 00:46:14.831 Well, it’s very simple. Like, this morning, a new customer came in with 15 different kinds of data. He just plugged in with 15 objects that each one of them has maybe 10 sensors, I don’t know. I’m not going to work on that. I’m not going to look at the data and say, “Oh, I want to do this schema.” I don’t want to do anything. And in the past, I’ve been trying to look at the data coming and try to build schemas on the fly, and it’s a lot of engineering. You have no idea how complex it is to build that. And with InfluxDB, it just comes out of the box. You just show the data and it’s going to be inserted. And you have your measurements and your fields, and that’s it.
Caitlin Croft: 00:47:07.999 So you have your data multiple times in InfluxDB: once for raw data, at least once for intermediary series, as you mentioned, and also the final series with labels. So at least three times your data at the end of the process. No hard tension on storage and so on, infrastructure and costs. I’m just kind of clarifying. Or is the final data not stored in InfluxDB and just generated on the fly?
Julien Muller: 00:47:37.235 No, we do retain the data for certain projects. I mean, I would keep the raw data as much as I can because if I have to redo calculations, having the raw data gives me the possibility to redo everything. But, I mean, that would be an explosion of servers to keep all my data forever. I wish I could do it, but no, we use the retention periods. The idea of the retention periods is that they don’t have to be the same for everything. So it’s not really storing three times the data. It’s over a period of time. So maybe, if you can, you keep some raw data for, I don’t know, three, six months and you store also your feature tables, and these feature tables could be stored for like one or two weeks. That’s why I was talking about cache because that’s really always the thing. It’s intermediate storage. We’re just keeping it as long as we can because usually when you’re iterating on choosing an algorithm, doing some modifications, it happens at the beginning of the process. So usually, you need your feature tables to be done in the first weeks of a project, and then eventually you don’t need them anymore. If you do, you recalculate. And you have to pay the price at some point. So either you have storage and you store, either you recalculate and you pay for CPU. So it’s really you decide where you put the cost.
Caitlin Croft: 00:49:20.281 Absolutely. And I like that you kind of talk about, you do retain the data for different durations depending on what you need it for. You don’t need to keep all of your data for the three to six months, but some of it you only need for a couple of weeks.
Julien Muller: 00:49:34.378 Yeah.
Caitlin Croft: 00:49:35.868 Earlier in the presentation you mentioned a couple of different machine learning algorithms that you use. Do you mind kind of highlighting those again? I guess I just wanted a little bit more information on those.
Julien Muller: 00:49:49.253 Yeah. I mean, classically, in anomaly detection, you would use maybe One-Class SVM and Isolation Forest. As a data scientist, you look at the data and you build yourself an opinion about which algorithm you should use first. And so let’s say, oh, it’s satellite data. So satellites don’t fall down from the sky, so probably the data I get from the satellite doesn’t contain anomalies. So I can use One-Class SVM. This is really instinct-based. And the other approach is to calculate scores on your results that you can actually compare actual results on your algorithms. So basically, we would use a dozen of algorithms. As I told, One-Class SVM, Isolation Forest, Gaussian Envelope, [inaudible], which is more like deep learning, LSTM, which is very interesting because we are using LSTM without features, so that we are not having an opinion about the future we are choosing. In this case, the algorithm is going to choose by itself the features. So it’s interesting in some cases. We have some algorithms that are very fast but, well, the quality decreases. I mean, what is detected is not as good.
Caitlin Croft: 00:51:24.514 What database size and row count are you storing and working with? How good is InfluxDB on performing with those loads?
Julien Muller: 00:51:38.318 So we evaluate databases per customer, so it can vary a lot. So basically, our rule of thumb is that we want to get one million data points in less than one second. So it’s specific to our use case because when you talk about real-time, it really depends on the use case. I mean, I’m not sending the data back into the system within milliseconds using MQTT. We are using that to run and detect anomalies. So we believe that one second is an acceptable time frame in this specific use case. So in order to get that, we have to make chunks of one million data points, and that’s the kind of performance we are trying to make sure we achieve on all our customer instances where usually, we try to retain a couple of terras of data.
Caitlin Croft: 00:52:46.721 All right. What do you think about InfluxDB’s labeling feature versus your company’s labeling feature?
Julien Muller: 00:52:55.995 Well, yeah. So as I told, a labeling feature in a complex UI with other features is very different from trying to get as many labels as possible in a very short period of time. So I think it’s interesting. As I told, the more information you can add on your data, the better. So please use InfluxDB UI for adding labels. If you really want to put tens of thousands of labels on your data set because you want to do classification, you need something different. So I think it really depends on the use case, where you want to go.
Caitlin Croft: 00:53:47.357 All right. Did you consider using FFT to find anomalies in time series data?
Julien Muller: 00:53:57.436 Yeah. So it has a cost in terms of CPU. So we’ve been using it in very custom calculations for very large engines’ vibrations. Usually, we’re trying to find other ways to modelize the normality in order to have fast results, faster calculations if possible. But I think it’s a good idea anyway. I mean, I’ve been doing it. I’m not going to say anything against that, but I’m trying to find faster ways in general, if possible.
Caitlin Croft: 00:54:37.221 Yeah, that totally makes sense. Thank you, Julien. We’ll just stay on the line for just another minute here. If anyone has any further questions, please feel free to post them in the Q&A box. Thank you, Julien. That was a really great presentation. I think people really enjoyed it. Lots of good questions. Once again, this session has been recorded, and the recording, as well as the slides, will be made available later today, so be sure to - if you want to go back and rewatch, it’ll totally be available for you, and you can go back and re-watch it and review the slides. So really appreciate it. I always love getting to hear our different customer stories. I mean, this is the second time that I have listened to your presentation, and I learned even more today than when we reviewed it before. So it was a fantastic session, and thank you so much for joining.
Julien Muller: 00:55:33.005 Thank you, everyone.
Caitlin Croft: 00:55:38.101 And once again, another friendly reminder. I know I mentioned it at the beginning. We have lots of upcoming webinars as well as virtual events coming up over the course of the next few weeks, so be sure to check it out. I look forward to seeing everyone on Thursday’s Telegraf webinar as well as The Tech Talks with Paul Dix and our virtual meetup and all the other events. So it’s always fun seeing familiar faces. Thank you, everyone, and I hope you have a good day.
Julien is the technical lead at Ezako. Prior to joining Ezako, Julien worked for 12 years at IBM as a Big Data Architect and Analyst on heterogeneous data in California and in France.