Coming soon! Our webinar just ended. Check back soon to watch the video.
How Veritas Technologies Uses InfluxDB to Enable Time Series Forecasting at Scale
Webinar Date: 2019-05-14 08:00:00 (Pacific Time)
The growing popularity of IoT, sensor networks, and other telemetry applications lead to the collection of vast amount of time series data which enable forecasting for a multitude of use cases from application performance optimization to workload anomaly detection. The challenge is to automate a historically manual process handcrafted for the analysis of a single data series of just tens of data points to large scale processing of thousands of time series and millions of data points. In this talk, we will show how to leverage InfluxDB to implement some solutions to tackle the issues of time series forecasting at scale, including continuous accuracy evaluation and algorithm hyperparameters optimization. As a real world use case, we will be discussing the storage forecasting implementation in Veritas Predictive Insights which is capable of training, evaluating and forecasting over 70,000 time series daily.
Watch the Webinar
Watch the webinar “How Veritas Technologies Uses InfluxDB to Enable Time Series Forecasting at Scale” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Veritas Technologies Uses InfluxDB to Enable Time Series Forecasting at Scale”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Thom Crowe: Community Manager, InfluxData
• Marcello Tomasini: Senior Data Scientist, Veritas Technologies
Thom Crowe: 00:00:00.067 So thank you for joining us. Today, we’re joined by Marcello Tomasini, who is a senior data analyst — or who is a senior data scientist at Veritas, and he’ll be discussing how Veritas Technologies uses InfluxDB to enable time series forecasting at scale. And with that, I will pass it over to you, Marcello.
Marcello Tomasini: 00:00:22.183 All right. Thank you. And hello, everyone. So yes, I’m a senior data scientist for Veritas Technologies. Let me tell you a little bit about Veritas Technologies. Might not be very famous or well-known brand simply because it used to be part of Symantec from 2004 to 2014. So the brand got somewhat lost. But they are the number one data management company. So they have this 360-degree approach to data management where they take care of your data from the cloud to on-premise and hybrid solutions. And they are really strong and famous for two products. One is for software-defined storage InfoScale, which is a distributed file system, and for NetBackup, which is the number one backup and recovery software for enterprises. So if you need, for example, to back up a data center, you will probably use NetBackup. And in fact, Veritas has been awarded as a leader in Gartner’s Magic Quadrant for data center backup and recovery for the past 15 years, which is pretty incredible. And Veritas solutions are used pretty much by any large corporation, 86% of Fortune 500 and 97% of Fortune 100. And if you want to take a look at the other products that Veritas has, there is a nice page that covers all of them.
Marcello Tomasini: 00:02:12.210 So with further ado, let’s go into the agenda for today. I will introduce some basic definitions for forecasting so everybody has a common understanding, and then I will discuss the use case for which we are forecasting at Veritas Technologies. And then we will go into the meat of the talk, talking about model selection and then the validation and how we do the online tuning while using the support provided by InfluxDB. So let’s start with the definition. So what is forecasting? Forecasting means making prediction of the future. It does not mean predicting the future. So we make prediction with a clear understanding that those predictions are not exactly perfect. So they have a error. And what makes forecast useful is when that error is small for the type of use case we are trying to address.
Marcello Tomasini: 00:03:23.815 Also, in order to be able to forecast, we need to have a understanding of the process that we want to forecast. And along with that, we need data. With the advent of IoT and modern time series databases, it’s a lot easier to work with time series. And we have a lot more data, which makes a forecast possible for a lot of use cases. And one key assumption of a forecasting is that whatever we are trying to forecast is based on the past history. And since we cannot see the future, it means that we assume that whatever happened in the past is a good indicator of what will happen in the future. If that is not true, we cannot forecast.
Marcello Tomasini: 00:04:23.023 So there are also several type of models for forecasting. There are machine learning-based models that they use multiple variables called predictors. But what has been the most popular is the time series models. And these are very popular because they don’t require to know all these variables or extract these variables to train the model. They just look at the past values of the quantity that we want to forecast, and they forecast the future. And also, many times, a simple time series model performs as good as a more complex machine learning-based model.
Marcello Tomasini: 00:05:06.299 Two other important concepts that you might have come across if you read about forecasting is seasonality. Could be additive or multiplicative. So for example, here in the pictures, we see, on the left, a additive type of seasonality, while on the right, a multiplicative seasonality. The difference is just the spread of the seasonality. In a additive model, it stays constant. In a multiplicative, it increases over time. So depending on the type of data you want to forecast, you want to keep that in mind. And a common pattern to deal with time series is to take the original data, find the trend in the data, remove the trend from the data, then find the seasonality in the data, remove the seasonality, which is a recurrent pattern. And then whatever is the remainder is assumed to have a mean zero and no auto-correlation. If the mean is not zero or there is auto-correlation, it means that we didn’t do a very good job of extracting the trend and the seasonality.
Marcello Tomasini: 00:06:20.647 Okay. With that said, we can move to the specific use case of Veritas. And we use forecast for several use cases of Veritas, but the one that is most important is the storage forecasting, and that is because of the Veritas Predictive Insights. What we’re trying to do is keep track of the storage consumptions of these appliances, NetBackup appliances, that are used for backup. If the appliance runs out of storage, then the backup fail. And if a backup fail, it means that at that point, if any type of event happens on the infrastructure of a company, we are at risk of data loss.
Marcello Tomasini: 00:07:12.609 So Veritas has more than 10,000 of these NetBackup appliances deployed in the wild and reporting daily data to Veritas. They report a host of type of data about — what is interesting for us is the storage data, which reports one point every 15 minutes and for each storage partition type in the appliance. And we have more than two years of data. So overall, we have more than five billion data points of storage. And this storage forecast can be used, for example, for resource planning. It can be used to detect workload anomalies. So for example, if there is a sudden spike in the storage used for backups, it can be used to identify data unavailability or SLA violations if we are running out of storage, for example. And also for sales, our support can clearly see if a customer is running out of capacity and then try to sell additional capacity.
Marcello Tomasini: 00:08:33.951 So this is how the data looks like. It’s just a timestamp. And then for each partition type, we have the amount of storage used in bytes. And forecast data is also consumed by other algorithms. Specifically, we have a set of algorithms that they try to assess the [inaudible] status of appliance. We also have a UI. The UI is only available internally. So it will be probably available in the future too for our customers and partners. But this is what it looks like. It shows the storage consumption for each partition and forecast. And all this forecasting process and other machine-learning algorithms run on our custom machine-learning platform. It’s actually a relatively standard type of platform that other corporations also implemented. We have a Kubernetes cluster running our pods on top of it. We have Apache Airflow to manage the machine-learning pipelines. And then on the data layer, we have the usual document-based storage, like MongoDB, ArangoDB. We have S3 buckets. And for all the time series data, we have InfluxDB.
Marcello Tomasini: 00:10:07.815 So why did we choose InfluxDB for the time series data? So first of all, as I said, we have more than just a storage data. We have also CPU usage. We have disk IOPS, network IOPS, and so on. And all this data comes in as a time series. So the time series database is being used by multiple algorithms. And for that reason, it means that it has to be able to serve a heavy workload. And at the same time, it is used to serve the data back to the UI. So the queries that I showed before for the UI that displays the forecast go directly to InfluxDB. And in that case, InfluxDB is very easy to work with because it provides a REST API in itself.
Marcello Tomasini: 00:11:20.703 It also has a lot of other nice features that makes it a lot easier to work with the time series data. For example, the group by time, it seems like a relatively simple operation, whether if you’re trying to implement time series [inaudible]. For example, with a document-based storage like MongoDB, you will end up creating time buckets in your documents for performance reasons. That means that you have to choose all the system a lot less flexible. Also, by default, [inflating?] or specifically addressing, which is supported by the community.
Marcello Tomasini: 00:12:17.141 And then another super useful features we found is this DataFrameClient that is part of the Python library to interface with Influx and is very, very useful because you just query Influx, and Influx will return a Pandas DataFrame. So most of the data science work becomes trivial because the data is laid out already in your dataframe. So our hardware setup, we have four nodes for the Kubernetes cluster for computational purpose. That is where the forecast gets run. What is more interesting is the Influx machines. As I said, we run in a HA setup. In the two links, there is this link to InfluxDB Relay. The second one is the one that is supported by the community.
Marcello Tomasini: 00:13:22.612 So as you may see, the machines that we use to run Influx are not really big from a computational point of view, but they have a lot of RAM. And we choose SSD storage with high IOPS in order to be able to support a high throughput. And this is very important. Influx requires faster storage and also a lot of RAM to have high performance. So you want to really try to maximize both RAM and IOPS that are available. Also, it is worth noting that while 3,000 IOPS might seem a lot in practice, depending on the cloud provider, those IOPS, they might be just a burst rate or they might not be a sustained IO.
Marcello Tomasini: 00:14:20.697 So one setting that you might want to tune on your Influx setup is the wal sync in order to try and [inaudible] the IOPS. In order to run the HA setup, we have this Nginx load balancer. The way we run it is not exactly the same way shown in the picture. It does not run on its own machine. It’s actually being run on one of the two Influx node. So the picture is sort of collapsed. And we found out that Nginx can be, in itself, a bottleneck. So you want to make sure that there are enough workers and connections allowed on Nginx so it does not become the bottleneck. And especially if you have a heavy workload, you might find out that Nginx in itself tends to hamper the performance because it starts caching, so it eats up the RAM of one of the two nodes. So one thing that we are planning to do is to move from a Nginx load balancer to a Elastic load balancer in the production deployment in AWS. A nice feature of the HA setup is also the query. So any read workload go to both of the — they go to both instances. So your read throughput pretty much gets doubled while the write throughput, in order to maintain consistency between the two instances, is thus limited. It has to write both of them at the same time. So it is limited.
Marcello Tomasini: 00:16:20.779 Okay. So now, we have the hardware setup in our machine-learning platform. We need to run the forecast in a automated way. And when we are trying to run in an automated manner, there are a lot of [fissures?]. That is because, as I said, we have more than 10,000 appliances, and for each appliance, we are forecasting for each type of storage partition. So we end up with more than 70 — it’s impossible to repeat the sort of process that I showed in the beginning, where you identify, or you might not identify this sooner. You did a trend, and [inaudible], and so on, right? You cannot expect somebody to take [inaudible]. The system has to be able to detect any anomaly and run it in itself and take actions based on whatever it finds.
Marcello Tomasini: 00:17:25.624 So let’s take the three challenges: model selection, validation, and online tuning. For the model selection, it means, pretty much, selecting what is the best model for the type of data we have. Truth be said, this part can be sort of done still manually mostly because we assume that the time series is pretty much — the data is coming from a similar source. So if you find a model that works for one type of time series, it is reasonable to expect that it’s going to perform across all the similar time series. But it still has a lot of issues. So for example, there might be missing values. We might decide to do processes called data importation where we fill in those missing values. But sometimes these appliances might not record it for an extended period of time, and the algorithm has to be able to handle this missing data. There might be outliers. It might be because of some bug or it might be because those are actually outliers. And one thing, we might decide to remove them so that they don’t skew the forecast. But in order to remove them, we need to be able to automatically detect them. And that in itself it is a tough problem.
Marcello Tomasini: 00:19:05.424 And then there are things like trend and seasonality, the composition that I was saying before. And this is [inaudible] model that they have to be extracted in an automated manner. Also, seasonality is not just one type of seasonality. There could be multiple seasonalities and with a different period. So you could have a daily seasonality or a weekly, and how do you detect that in the data? And then when we extract the trend, the time series data is not just a monotonic, increasing or decreasing. It might have a change in the trend. And if you try to fit a simple linear model without taking into account those change points, then our model would be completely wrong.
Marcello Tomasini: 00:20:06.894 And at the end, once we choose our model, we still have to tune it, right? We have to find a set of parameters of the model that gives us the best performance. And we could say, “Okay, let’s run a cross-validation.” We’ll talk about it later. But again, we have 70,000-plus time series. That is a really expensive process. So we settle with a model that takes care of most of the problem for us. It’s Prophet. It’s been developed by Facebook. It has those three components: a growth component, the G; a seasonal component, S; and then it has a holiday component, H, which is used, for example, to account for extreme events like Christmas or like Super Bowl events and the like. And then it has a very flexible model that handle the change points automatically. It does a four-year transform to the seasonality. And therefore, it’s very flexible model in itself. It’s very robust. It does not care if you have missing data points. It just fits whatever data you have, which makes it a lot easier to run at scale.
Marcello Tomasini: 00:21:29.544 Okay. So then we said we have to validate that our model actually performs. And in order to do that, we need to have some accuracy number, which is our error. There are different type of error measures for time series data. But it is important to keep monitoring the model as it runs also in production. Otherwise, we might risk to have a outdated model. And the cross-validation process in time series data is a little bit different than the usual cross-validation process in machine learning because time series data is ordered by time, so you can’t take a random sample of the data. You’ll have to take a continuous partition of data in order to train the model.
Marcello Tomasini: 00:22:21.521 So there is this procedure that is called Expanding Window Validation where, given available historical time series, maybe one year of data, we start with a small amount at the beginning. We forecast. We compare against historical data so we can compute the error. Then we increase the data we used and then forecast again, and we repeat this process until we use all the available historical data. Another way to have this cross-validation process is this Sliding Window Backtesting. At the points to our training data, we just use a fixed-size window. And we forecast and compare against available historical data. And this second procedure is more useful when, for example, we know that the process that is generating the data is slowly changing its characteristics. So we want to use only a recent window of data for the forecast. Otherwise, the forecast might take in account of all the data that is not relevant and then generate an invalid forecast.
Marcello Tomasini: 00:23:39.026 This procedure has to be run for every time series and for every model and for all the set of hyperparameters we want to test. So it’s a very expensive procedure, but if we run it, we obtain something like the one that is shown here. We can see, for each partition, we have a horizon, which is how far in the future we forecast, and we can generate a set of error measures. In the links here, there are links to papers that they describe each error metric, the pros and cons of the error metrics. Here, I used a MAPE. MAPE is called mean absolute percentage error. It’s the easiest to understand. It’s just the difference between the forecast data and the actual data divided by the value of the actual data and transformed as a percentage.
Marcello Tomasini: 00:24:43.787 So as I said, there are thousands of series, and we have to run this procedure for all and each series. And then we might want to try different models. We are setting down on Prophet, but you might have a ARIMA model. You can have exponential smoothing. You can have a LSTM, deep neural network, and so on. So it becomes very expensive to compute all this validation data. And since it’s computationally expensive, it might be run just as a one-off pass once a month to check the status of the system. Now, the problem is that if we run it as a batch run, the accuracy results, they are always updated. They always lag behind what we have in production. And that might be a problem because we try to aim for the best accuracy. So the solution is to leverage InfluxDB in this case. And then what we do is to compute the error online and save it as a time series. So that’s why we use InfluxDB.
Marcello Tomasini: 00:26:00.898 So how do we do this online computation of the accuracy data? So first of all, we have to save all the previous forecast data along with the historical data. And that is because it is used to compute the error metrics. In the previous procedure, we were just iterating over each time series, but in this case, we have to store the data and then retrieve it at a later point in time to get those accuracy metrics. And we save forecast data for each appliance and for each storage typethe mi. Also, since this data gets sort of queried in the future, we have to take care of doing a multi-horizon forecast. And that’s because if we do the sliding window or the expanding window validation process, we can repeat the process for multiple horizons. In this case, once the forecast has been computed, there’s no way to go back in the past and update the forecast with a different horizon, unless we actually rerun the forecast on the past data, and that will defeat the purpose of avoiding all the repeated computation.
Marcello Tomasini: 00:27:30.163 And so then what we do is, as we compute, as we do the new forecasts, we take care of computing accuracy data based on the past forecasts and the past historical data. And that has a lot of advantages. It’s very computationally efficient because the computation is distributed over time. It enables A/B testing. You can run two models at the same time and keep track of their error time series to see who’s performing better and also enables real-time monitoring. You can check, again, the error. And if the error, for some reason, spikes up, then you know that something changed in the underlying process of data, and you want to take a look what’s happening.
Marcello Tomasini: 00:28:22.915 So how do we save the data on Influx? So one solution is to have one measurement for all the historical data and one measurement for all the forecasts. And we have the tags. We set the type of the storage. Now, the serial is the appliance identificator. And then other fields, we have the values of the data: Yhat for the historical data and Yhat lower and upper for the forecast. And we might also see here that they put this history_cutoff and cutoff: history_cutoff as a tag and Cutoff as a field, they both have the same value. Here, the history_cutoff represents the last point — the timestamp of the last point in the history of the data used to train that specific forecast. So when we group by history_cutoff, we pretty much group by a specific forecast. And that is why I added this value.
Marcello Tomasini: 00:29:45.703 So this approach is simple, very effective. It lets you bring up your system very quickly. The problem is, every time we need to compute the accuracy data, we have to query both the historical data measurement and the forecast measurement. And we also have a problem of cardinality in the measurements, and that is because the serial has a cardinality of about 10,000, and then type has a cardinality of around 10, and the history cutoff might grow in cardinality to a large number because if, for example, we run the forecast every day, then we will have a new value for history cutoff every day.
Marcello Tomasini: 00:30:38.347 So in order to actually scale this approach, we moved to a solution where we have one measurement per appliance. And each measurement, in this case, contains both historical and forecast data. And in this case, we had to add a Boolean flag in history, so as a tag, which is used to identify what is historical data and what is forecast data. This approach scales well. We don’t have the issue with cardinality anymore. The problem is we might not want to retain all the forecast data, or we might want to retain forecast data for a short amount of time compared to historical data. And in that case, we need to set up a retention place specifically for forecast data, or we need to have a delete query running as a background cleanup process.
Marcello Tomasini: 00:31:48.186 Now, how do we lay out the accuracy data in Influx? Again, we start from a simple solution, all the accuracy data in one measurement. And we set as tags the storage types, again, the serial of the appliance, and then the horizon. So as I said before, we forecast for multiple horizons so that we can compute the accuracy metrics for multiple horizons. And the fields are those numerical values for the errors. This setup is very useful if we needed to rely on cross-analysis. So if you want to do some — if you want to build the table I showed before with all the error metrics and the error distributions across the appliances, this layout of data is very easy to deal with. However, if we had the issue of cardinality that was already pretty bad in the forecast data, here it’s even worse. And Influx has this limitation where the memory consumption does not scale very well with the series cardinality, so you might end up being limited or not even being able to query the measurement.
Marcello Tomasini: 00:33:25.082 So again, we moved to a solution where we split the accuracy data per appliance, so one measurement per appliance. And this scales very well. The problem is, if we now want to run a cross-analysis, it becomes a tedious process because we have to fetch data from all the 10,000 measurement. Now, that type of cross-analysis probably are those type of cardinalities that are run once in a while and are used to show those values to some business people. So I would assume that type of workload can be run as a batch process offline. So this is a trade-off we can accept. However, we still have one limitation, which means we cannot track the model updates and the configuration.
Marcello Tomasini: 00:34:25.621 So from this type of data, we don’t see what model generated that accuracy numbers and what configuration, so the hyperparameters of the model that were used to generate accuracy data. So before providing a solution for this last issue, let’s take a look at how we actually do the online tuning. As I said, we could run the sliding window or expanding window cross-validation for the tuning. That is a good one-off run to get the initial values of the hyperparameters of the model. It is not enough to keep updating the model while it’s running in production so that it always has the best performance.
Marcello Tomasini: 00:35:25.649 So how do we do that? And here, we rely on a mathematical tool that helps us to solve a host of problems. We have one model per time series, which means that we need to tune a model per time series. Again, more than 70,000 models. I was saying, in the beginning, we assume that process of data is not changing over time. That’s why we can use past historical data to forecast the future. However, this assumption is not really true. Usually, the process slightly change over time, and we have to take in account for that. Otherwise, the performance of our model will drift. And then again, the backtesting procedure works well if it’s run one-off. It doesn’t really work well if we need to run it often.
Marcello Tomasini: 00:36:32.636 So we use this sequential model-based optimization, a mathematical tool that works very well. It is made of three components. It has an objective score, which is our error. It could be a [inaudible] error. It could be the MAPE. And this is what we want to minimize. Then we have a surrogate function. This function is trying to model, given your model F, right, and the set of hyperparameter X, what will be the performance of the model given that set of parameters and that model so that we can use this in combination and repeat the process using what we call a selection function. A selection function is nothing but a function that gives you the set of hyperparameters X, and we repeat this procedure in iterations until we try to minimize the objective score.
Marcello Tomasini: 00:37:51.793 And this is how it looks like. So we have this surrogate function. At the beginning, all the hyperparameters are equally probable. That’s because we don’t have any prior knowledge on the error. There is some function of the error. That is the one showed in the dashed line, a yellow dashed line. And what we’re trying to do is trying to find the set of hyperparameters that minimize our — and here, we sample. We start from a initial set of parameters. We check the error. We update the surrogate function. So the surrogate function tends to converge towards the actual error function. And after some iterations, we end up selecting the set of parameters that has the smallest error. So we can run this procedure online while we’re doing the forecast because we have the past forecast data and the past historical data, so we can compute the error and then update, after every run, the surrogate function and then get a new set of hyperparameters.
Marcello Tomasini: 00:39:22.350 So how do we actually do that? How do we do that from a data point of view? So how do we lay out the data in the Influx database to enable this sequential model-based optimization? And here, we go back to the accuracy data. We add two or more fields. One field is the model type. It’s just to identify what model was used to generate that accuracy data. And then we could add all this set of hyperparameters used to generate that accuracy data. Now, this has the same problems as before for the cross-analysis, but there are more problems introduced by the fact that model type and hyperparameters, they usually change — I mean, the hyperparameters are specific of each model, which means that if we are switching model often, we have a explosion of the cardinality of the accuracy data. Also, some models, they have a lot of hyperparameters. For example, neural networks, they have tons of parameters, and it’s not realistically feasible to store all of them as tags. So there is no simple solution for that. Can’t rely completely on Influx to solve these issues. Instead, what we did was to externalize the hyperparameters.
Marcello Tomasini: 00:41:10.988 So we replaced the model type and the set of hyperparameters by sort of a hyperparameter identifier. This identifier is nothing but a foreign key to a document-based store. In our case, we used ArangoDB. It could be MongoDB. It doesn’t matter. And for each document, we have the model type and all the parameters of that model. So when we query this data, the hyperparameters document ID is actually identifying both the model and the set of parameters that were used to generate that accuracy data. So this works very well. And whenever our sequential-based model optimization process training your model, we can just pass the new settings to the forecast algorithm and the data will automatically save the hyperparameters document ID associated with that model and set of parameters as part of the accuracy data.
Marcello Tomasini: 00:42:35.636 So in conclusion, we saw that InfluxDB is very powerful tool to enable forecasting at scale. We carefully chose a setup that optimize IOPS and the memory, and we increased, at least, every throughput by using a HA setup that is freely available and open source online. InfluxData has its own high availability setup with the enterprise setting. That is the one that is probably recommended if you’re using the HA setup for, actually, resiliency and not just for throughput. And we also took care of the issue of series cardinality. At the moment, it’s probably the main factor to take in account when dealing with time series data and InfluxDB. And we provided a basic data layout that can be used to support online accuracy evaluation and algorithm tuning for time series forecasting.
Thom Crowe: 00:43:55.211 Thank you so much, Marcello. Thank you for attending. And we’ll see you in a couple of weeks.
Senior Data Scientist
Marcello Tomasini is a computer engineer and scientist interested in Machine Learning, Computer Security, Complex Networks, and Biology with a Think Different life style. Marcello holds a B.S. and a M.S. in Computer Engineering from University of Modena and Reggio Emilia, Italy, and a Ph.D. in Computer Science from Florida Institute of Technology, USA. He has several papers published in international peer-reviewed conferences and journals in the areas of mobile sensor networks, human mobility modeling, and machine learning. He currently works as Sr. Data Scientist at Veritas Technologies where he designed and developed the system reliability score and the storage forecasting algorithms implemented in Veritas Predictive Insights. His free time is a mix of gym/bootcamps, machine learning meetups, and traveling.