Coming soon! Our webinar just ended. Check back soon to watch the video.
Webinar Date: 2018-09-18 08:00:00 (Pacific Time)
Go-Jek is a startup that specializes in ride-hailing, logistics and digital payments in Indonesia with recent expansion into Vietnam, Singapore, and Thailand. They use InfluxDB for storing and collecting metrics from systems and applications. They use these infrastructure and business metrics for monitoring and alerting, gathering 55,153 points per second during peak times, all written into an InfluxDB instance. With such a heavy load, they faced the issue of high memory and disk space utilization and instead of scaling the InfluxDB cluster horizontally, they solved the disk space problem by downsampling their metrics data.
In this webinar, Aishwarya and Anugrah from Go-JEK will share how they used downsampling in InfluxDB to solve their disk space problem. They will also share how they used InfluxDB and Grafana to build their monitoring solution—a solution that saved them from downtimes, rising machine costs and countless pages buzzing in the night forcing them to burn the midnight oil to address performance issues. This will also talk about how they automated this solution using Chef and Terraform for all the InfluxDB and Grafana instances.
Watch the webinar “InfluxDB Downsampling to Avoid Burning the Midnight Oil” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “InfluxDB Downsampling to Avoid Burning the Midnight Oil”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Aishwarya Kaneri: Product Engineer, Go-Jek
• Anugrah S.: Product Engineer, Go-Jek
Aishwarya Kaneri 00:00:00.000 Thank you Chris. Hello, this is Aishwarya. I work as product engineer at Go-Jek in infrastructure dev.
Anugrah S. 00:00:08.176 Hi, I’m Anugrah. I work as a product engineer, mainly concentrating on the food delivery backend for Go-Jek. So at Go-Jek, we have multiple services like ride hailing, food, and under food, Go-Pay payment service is extra. A lot of services under one app, and we have a lot of micro-services so collecting of the metrics becomes an issue, and we use InfluxDB mainly for the collection of metrics like business metrics or application metrics or system metrics. So here we will be discussing mainly about how we used InfluxDB downsampling mainly to avoid burning the midnight oil. So these are the content. This is all we are going to go ahead with the topic. So first we will be starting off with how we use InfluxDB at Go-Jek. So how do we use InfluxDB at Go-Jek, that’s where we’ll start, and then we will be discussing about our monitoring and alerting architecture, then follow it with problems encountered with that architecture we are changed from that. So after that we will be discussing how we tried fixing the issues through minor and temporary hard fixes, and how we did not actually solve the problem, so we had to do a sort of major fix to solve the problem. Then we will be discussing about what exactly is data downsampling. And then we will be discussing about procedure for the data downsampling, followed by data downsampling automation, as automation need to be done for any of the solution at our scale. Without that, it’s not possible [inaudible].
Anugrah S. 00:01:49.362 So then we will be discussing about the issues faced during the data downsampling. Afterward, we will be showing you the Grafana dashboard, and then we will be concluding the talk. So yeah.
Aishwarya Kaneri 00:02:05.486 So let’s see how we use InfluxDB at Go-Jek. This is the architecture of monitoring and alerting, which we are using. So Telegraf is running on all the VM instances, and it sends system application [inaudible] business metrics to InfluxDB. The Kapacitor, it checks InfluxDB and it checks for if the alerts which we have set, if they are crossing threshold and it sends that to Beacon Service, which we have written. The Beacon Service decides which team we have to send these alerts, so it finds the respective team and it finds its Slack channel and sends the alerts over Slack, and also via PagerDuty. Then we use Grafana as our dashboard, and Grafana queries InfluxDB to get all the dashboard data. So in our [inaudible] architecture, all the read and write queries, they first hit the Load balancer, and Load Balancer sends all the write queries to InfluxDB relay and all the read queries are sent directly to InfluxDB. We are using two InfluxDBs here for reliability, and every team has been given this entire setup architecture of Load Balancer, the two relays, and two InfluxDBs. So all the write queries are sent to this InfluxDB relay, and this relay will redirect it InfluxDB—to the two InfluxDBs.
Anugrah S. 00:03:47.143 Going forward, we’ll be discussing about the problems that we encountered with the current architecture. We had shown the architecture diagram. So one of the main issues that we faced were the high memory usage. As we had discussed earlier, at Go-Jek we have multiple services, and on an operation, multiple services will be having a bunch of micro-services. And each and every service would be having its own bit of system metrics, which would be CPU usage, process checkers, disk I/O, network I/O, bytes out bytes in, and many people actually use it for the collection of business metrics, like we push out that feature for a bunch of users and we have to realize where the feature is actually being adopted, how many people are facing issues with it, how much of [inaudible] are they getting, and whether it’s a successful feature or not, does the user actually like the feature or not. Based on that, we make our decision making. So we come into all these metrics from multiple services, and we actually send it over to the InfluxDB and then follow it, and then we show it [inaudible] Grafana.
Anugrah S. 00:05:07.817 So this becomes an issue, because as more and more series and measurements [inaudible] creating and creating, the read [inaudible] becomes getting slower. And then after some point of time is that—if the reads get slow, and then we were using an older version of InfluxDB [inaudible] on TSM.
Aishwarya Kaneri 00:05:33.244 So whenever new data is entered into InfluxDB, it gets added in the disk as well as it gets added into the in-memory structure, which is TSM.
Anugrah S. 00:05:46.618 So as more and more series data coming in, we started realizing that we were having a lot of issues with the memory and with the disk usage as well. And we usually monitor our own monitoring architecture as well, and we were getting a lot of PagerDuty alerts during the midnight, and yes, burning of the midnight oil started for us. So we started fixing it, we started fixing it because we realized it was going to be a concern. We started actually putting in a limit on a number of series and measurements, which was met with a lot of skepticism from people because it had become a successful product for a lot of people to dump their business metrics. And people were not happy with this approach. Then what we started, is that we started giving InfluxDB priority, so each and of the team had their own InfluxDB, and InfluxDB credential. With the Grafana, with the Kapacitor, and other stuff, so that they could set up on their own and they could increase on their own so that we should not be having any kind of operational issues.
Anugrah S. 00:06:50.851 And we started going for the increasing of disk size so that people did not face any kind of issues with the disk size. And then we also did a bunch of things increasing—we did a lot of increasing on the memory size also. We tried both vertical and horizontal scaling approaches, but we realized after a point of time that both vertical and horizontal scaling did not actually solve the problem because it was not helping much. So now, we come to the major fix. So what we realized after facing this issue and reading a lot of blog posts about the kind of problem we were solving, we realized that the main solution that we could go ahead was the data downsampling, which would have solved the problem for us. So we’ll be discussing what exactly is the data downsampling. So as you can see over here, this is, the yellow ones is actually the actual waveform, actual collection of metrics. You can see a bunch of things like haproxy’s, metrics are being collected over here. How many requests an haproxy is actually getting. This is what the yellow ones represent, and the green one is actually the downsample.
Anugrah S. 00:08:12.226 So I’ll start explaining what exactly is the downsampling using this method. So you can see, you can particularly take a look at from 16:50 to 17:00, so it’s actually a 10-minute interval, so you can see that there are close to 10 points over here via collector, 10 metrics point over here, each representing a different end of a spectrum, like one of them is actually 7.2k requests, and the other end is actually 7k requests. So what we would realize is that between 16:52 and 17:00, all of these metrics could be represented by the mean of the metrics collected between the time period of 10 minutes, without losing any kind of information, as in we’re not losing much of information when we actually do this. So this is what it’s called, downsampling. Downsampling is actually the process of changing the waveform or structure of the wave without losing much of information in the process.
Anugrah S. 00:09:19.656 So you could see that the number of request we are collecting over here has reduced from 10 to actually two—two being that one over here and one over here. So we are actually aggregating between a time period of five minutes, and then we represent that by calculating the mean and then writing that [inaudible] downsampled [inaudible].
Aishwarya Kaneri 00:09:40.703 So the procedure for downsampling includes two steps. The first is retention policy. So we have to create a retention policy, which would stay for how long you wanted data to stay in InfluxDB. So our normal data, the retention policy was of one month, and we were collecting this data at the rate of one data point per 10 seconds. So we wanted to reduce the retention policy to two weeks, but the issue was that the teams wanted to visualize the normal data within a timespan of at least one month. So they didn’t want us to reduce the retention policy of the normal data to two weeks. What we did was downsample this data and keep the retention policy of the downsampled of one month, that is four weeks, but reduce the normal data’s retention policy to two weeks so that the teams can view the normal data at the rate of 10 seconds, as well as they can get an idea of how the data looks like for one month by visualizing the downsampled data.
Aishwarya Kaneri 00:10:57.449 And the second step in the data downsampling is Continuous Query. So this is a periodic query which will run inside InfluxDB, and this query will do the data downsampling and gather the results and insert it into the downsampled retention policy measurement. So we used data, we did data sampling automation as well. So we are using Chef Cookbooks for this automation. We did this automation so that we had a lot of things, and we are installing Telegraf and the entire infrastructure for every team, and there are multiple instances of InfluxDB. So we want to do data downsampling in each InfluxDB of every team, and we can’t do it manually one by one because there are so many instances of InfluxDB. That’s why we needed to automate this, and we used Chef Cookbooks for this. So the first step is we wrote a data downsampling recipe for which we are using the data downsampling resource which I will show you in the next slide. So the first step was to reduce the retention policy of the normal data, that is called autogen for us, and we reduced from one month to two weeks. The second step was to add a new retention policy, which was to be attached with the downsampled data, and this was for four weeks.
Aishwarya Kaneri 00:12:34.847 Secondly, because we have two InfluxDBs running for every team, we want to start this data downsampling at the same time, so that the data remains in sync, the downsampled data remains in sync in both the DBs. That’s why we created a [inaudible] for this, and we scheduled it to run at midnight when this recipe run for the first time in every InfluxDB boxes. Also, there is a Continuous Query script which will run the continuous query recipe. In the data downsampling resource, we first created the retention policy for which we are using the InfluxDB Ruby library, and the second one, this is for autogen retention policy, and the second one is for the data downsampled retention policy. The second recipe is for continuous query. In this query, we aggregate the data by using mean, and over here, we are doing SELECT(*) mean from all the measurements of the database of autogen retention policies, so this represents the normal data.
Aishwarya Kaneri 00:14:02.922 And we are grouping by time of five minutes, so the result of this query is dumped into the same database, but a different retention policy that is data downsampled retention policy. All the measurement names remain the same so that after downsampling, what will happen is for a given measurement, a new retention policy is created and the downsampled data is dumped in that policy. Only one change is that the column names of that measurement, they get prepended by mean_. So this was one major drawback, and this issue was faced by multiple people, and for this there is a very old open issue in InfluxDB. Yeah, so this is how we ran the Continuous Query. Now, the issues which we faced during data sampling were—the first one was the Continuous Query, as soon as it was [inaudible] it started increasing load on InfluxDB because this query is a periodic query, and it will run after the given time interval of five minutes. This started increasing the load, and that’s why the InfluxDB became quite slow.
Aishwarya Kaneri 00:15:32.809 So this was one issue which we fixed by adding the config, running this continuous query of five minutes, by decreasing the frequency of this query. The second one was we were having database names as hyphen, and we didn’t want to change that hyphen for all the InfluxDB instances because that was going to change our legacy infrastructure. And this hyphen wasn’t supported in the InfluxDB Ruby client, so we added a PR for this, for the support of hyphen database names, and after it got [inaudible] we were able to use hyphens with our database names. The third issue was the downsampled data of the measurement having “mean” prepended to the field names. So because we had all the [inaudible] Grafana dashboards, which have the column name, but now for downsampled data to be visualized on Grafana dashboard, the column name got changed and mean got prepended to the column name, so we couldn’t use that dashboard. And we didn’t want all the things to create a new dashboard for this. That’s why we chose the regex approach and we added a regex over there so that the team can just select whether they want to view the downsampled data or the normal data, and depending on what they select, they will view the graph with the corresponding data.
Anugrah S. 00:17:15.532 So after this, we’ll showing basically the downsampled data on the Grafana dashboard along with the original data. This was the original data, so you can see haproxy, you could see requests continuously being collected over here.
Aishwarya Kaneri 00:17:36.783 And we have added the regex over here for the mean prepended to the column name, and here the retention policy selected by the team’s autogen. That’s right there seeing the normal data in your dashboard.
Anugrah S. 00:17:50.159 Now if you go from here, and you can take a look at—this is actually after the retention policy, so what is happening is that this is not live data, this is the older data, and this has a retention policy of last four weeks, so you can see this is actually last four weeks, and then we are actually downsampled the data so that people will see for five-minute intervals.
Aishwarya Kaneri 00:18:13.559 So without asking all the teams to make measure changes in their dashboards, we just changed the regex, and depending on what retention policy the team is choosing in the Grafana dashboard, they would see the corresponding data in the same dashboard.
Anugrah S. 00:18:32.664 So now we look over the conclusion. We saw a significant amount of reduction in the memory usage and disk usage as has been explained earlier. And then automation actually helped us to scale the solution for multiple InfluxDB clusters without dealing with a lot of errors and other stuff, which is one of the plus points of having automation. And, yeah. No more sleeplessness nights, so yeah, that’s pretty much it.
Aishwarya Kaneri 00:19:00.439 Thank you.
Anugrah S. 00:19:01.687 Thank you. Thank you for having us.
Chris Churilo 00:19:03.903 So I’m going to open up to questions, so if anybody on the call has a question please put it in the Q&A in the chat panel. So let’s take a few steps back and talk about what was happening before you implemented this. What solutions did you guys use, what wasn’t working, what was actually keeping you guys up at night?
Aishwarya Kaneri 00:19:28.201 Most of the problems which we faced were a lot of business metrics and application as well as system metrics were pumped into InfluxDB, that’s why the disk utilization of the DB was increasing continuously. And as the new data enters the InfluxDB, it also enters in-memory structure of InfluxDB, which is the TSM model. That’s why the memory usage also increases. And for this, we used to get pagers, and the disk was never sufficient for all the business metrics. So whenever we got a pager, we used to increase the disk size, resize the disk, or have to increase the memory size as well.
Chris Churilo 00:20:15.737 Which I guess is just the easy way out, right? That’s just like, “Oh, I’ll just—”
Aishwarya Kaneri 00:20:19.242 [crosstalk], best way to fix this—
Chris Churilo 00:20:22.149 So then you’ve been using InfluxDB for quite some time then.
Anugrah S. 00:20:26.579 Yeah.
Chris Churilo 00:20:28.477 And then it sounds like—so did you also, and this is something that we’ve all done, considered just turning off the alerts? Which is not ideal, but—
Anugrah S. 00:20:37.779 So what would happen is that if you turn off the alerts and did not face the issues is that we would not be actually be able to—so a lot of people are actually dependent on InfluxDB for business metrics elsewhere, and we have put our alerting stack on top of that, so if one service goes down [inaudible], there’s a lot of micro-services over there, and a lot of VMs as well. So if one of the services goes down and we’re not getting alerts and it’s not actionable in the first place, then it becomes an issue, so we cannot actually turn off the alerts. And a lot of the business decision-making is actually dependent on that, because of the business metrics we send over here. So if you need to put [inaudible] for a bunch of segmentation, how it actually looks like if the [inaudible] is actually giving out the results that we require, or is the people actually happy with the feature that’s been churned out last week? So these kind of metrics aids us in the decision-making, so we cannot actually turn off the alerts. That’s not possible for us.
Chris Churilo 00:21:39.377 Okay. So we have a couple of questions that have come in, so we can we switch over to the architecture slide? So Gregory Fass asks: “Is your Influx architecture with the two relays and two Influx servers replicated, is that a homemade solution, or is that an enterprise cluster of two?”
Aishwarya Kaneri 00:21:56.002 No this is—we have a written a cookbook for this, and this is a homemade solution. So that [inaudible] or team request for this architecture, we have a tool like a command, and we run that command and give the team name and through that, just by that command all this architecture gets created. So we are also using Terraform for automatically creating the architecture and when that command runs, one load balancer as well as two relays and the two InfluxDBs are created. Also, Telegraf gets installed and other plugins which we are using with Telegraf, they get installed, and then that setup is ready to use. And whenever a VM instance is created, so depending on the team name, inside that VM instance again the Telegraf is installed and the endpoint for that Telegraf is depending on team tame, the endpoint is decided. So that logic again goes into the cookbook, and the instance belonging to one team sends all its metrics to the corresponding InfluxDB [inaudible] of that team.
Chris Churilo 00:23:16.066 Cool. And then I think the question tied to that from Gregory is, “Have you switched to the TSI storage engine, and if so, what have you found in regard to InfluxDB’s memory, CPU, and disk usage?”
Aishwarya Kaneri 00:23:31.830 Yeah. So when we were implementing this data downsampling, the version which we are using in InfluxDB is 4.1, and we saw that in the newer versions of 5.0 and above it, the sixth version, both of them have switched to TSI model, so what happens in TSI model of InfluxDB is whenever new data comes, it gets written into the disk but memory structure of database, the InfluxDB, it is not written in that in-memory structure. So as the number of series and measurements are increasing in the disk, the memory utilization still remains less. But there is one drawback for this—that when the dashboards are loading, it will take more time because there is no in-memory structure, no in-memory cache present, and it has to again and again query the disk for getting the data. So we saw that the dashboards were quite slow, and we have some very old InfluxDBs for which if we switched to TSI model, we have to make a lot of changes and there was a risk of losing the data, so we didn’t go with the newer versions. But we are planning to again spike that out in some other user stories in future.
Chris Churilo 00:25:08.278 So let’s just make sure we understand the version numbers. So you’re saying that you guys are still using 1.4 InfluxDB or—because you said 4.1, and I just want to make sure we get the version numbers correct.
Aishwarya Kaneri 00:25:21.260 Yeah, it’s 4.1.
Chris Churilo 00:25:24.764 Okay. And okay, actually our version numbers are—we’re currently at 1.6, so we just might. Maybe the numbers are just transposed.
Aishwarya Kaneri 00:25:38.008 Yeah, then it’s 1.4. Yeah. I got confused with that. But yeah, it is the older version, that is 1.4, and I think for InfluxDB the maximum version is 1.6 then, and in 1.5 and 1.6 TSI models are supported.
Chris Churilo 00:25:57.047 So Gregory says, “Thanks for answering my questions. I really appreciate it, super helpful presentation.” Thanks Gregory. So Georgio asks, “Hello. I didn’t understand the method you used in Grafana to use the choice of retention policy transparent to the end user.”
Aishwarya Kaneri 00:26:12.896 So in Grafana, we have added templates on top of the dashboard.
Anugrah S. 00:26:18.570 [inaudible].
Aishwarya Kaneri 00:26:20.289 Yeah. So on top of the dashboard, there are templates, so Grafana provides a method to add templates, so you need not hard-code any values inside your query. For example, there’s a service over here. So this is actually coming from the template. So above the dashboard, you create the template—
Anugrah S. 00:26:41.852 [crosstalk]
Aishwarya Kaneri 00:26:44.335 Yeah, so we can show you the actual dashboard as well. But here in the template, you select whichever server you want and then that value comes inside the server variable over here. So the same graph can be used to visualize the Http Request for different servers. So similarly, we created a template for retention policy, and we gave two options: One was autogen and one was last four weeks. So in the dashboard above, the person has to select which view they want to see. And depending on that, the values will come over here.
Chris Churilo 00:27:34.240 So did that answer your question Georgio? Okay, next question is, Georgio says, “thanks”. Okay, great. So [inaudible] asks, “Is this open source or Enterprise edition?”
Aishwarya Kaneri 00:27:53.909 Are you talking about the automation part or—
Chris Churilo 00:27:57.598 The InfluxDB. So this is definitely the open source version that they’re talking about. They are not using InfluxEnterprise.
Aishwarya Kaneri 00:28:05.413 Yes.
Chris Churilo 00:28:07.267 Okay, any other questions? Looks like you guys did a pretty good job of answering the questions, so when you guys started working on this project, it sounds like InfluxDB was already present. Do you know how or who decided to use InfluxDB, and what were some of the things that they were interested in?
Anugrah S. 00:28:31.680 It was mainly around the time series. When we started off using it, the main application was around system metrics and application metrics, but when we actually started having more stuff like Grafana and the alerts and other stuff when we put on top of it. So then people started actually using it for other business metrics as well, which we [inaudible] on.
Chris Churilo 00:28:56.813 So can you give me some examples of some business metrics that are really good indicators that something’s wrong with the infrastructure?
Anugrah S. 00:29:04.130 No, the business metrics indicate whether a person actually likes a feature or not. Like if I give you an example. So I think last month or a couple of months back, we had a feature of actually adding catalog management for our restaurant-facing interface. So we are doing food delivery, and the external-facing interface would be having catalog management. So the person who is in charge would be using the catalog management for multiple things, like they would be uploading [inaudible]. They may just send [inaudible], and they would be facing network issues or something similar to that or backend issues, or there’ll be a lot of requests coming in. So these kind of features—we have pushed out the feature, but we need to make sure that it’s actually useful for the end user because we are actually delivering it to them. So we have to see the number of people who are actually using it, how many number of people actually are using it right, and how many of them were successful so that we can actually make sure that the end user is actually happy with what we are delivering and how many are unsuccessful.
Anugrah S. 00:30:04.557 So then again, we know that there are actually a lot of people who suffered because of some timeouts or they’re actually uploading a huge image, like let’s say it’s something like 2GB or 3GB, and there might be cases wherein people are abusing the chance that they can upload images by uploading something like 2GB of stuff which is not actually an image. So all these things become valuable. These are business indicators. This says whether a feature is actually usable for the end user or not. So right now, we actually use the [inaudible] architecture for collecting the metrics around this as well, and making this [inaudible] stack. Like if the feature is not being used, let’s say a few thousands of users, we turn it off, and people are actually using it once or twice a day, and the majority of the catalog let’s say…which is not the case, but still, if you hypothetically, if you speak like it’s not useful, so then we don’t need any more active development on the feature or we might need actually to add more features so that people actually use it. We need to [inaudible] what would make the person use it, or is the feature actually useful? Should we actually turn off the feature? So we [inaudible] what we mean by delivering the business metrics.
Chris Churilo 00:31:17.593 Yeah, and it makes you smart in your—
Anugrah S. 00:31:20.219 [crosstalk].
Chris Churilo 00:31:20.978 Yeah, in your engineering planning, right? Because there’s nothing worse than knowing that it’s a very emotional decision versus, oh, here’s some data to prove that nobody’s using it because it’s too hard to use, or nobody’s using it because it’s useless, or—and actually seeing the data to help push that—I think is really important.
Anugrah S. 00:31:41.535 Which is crucial for us, because we are actually a fast-moving organization, because the decision making need to be extremely fast and backed with data. We cannot rely on hunches anymore. We can’t. So, yeah.
Chris Churilo 00:31:53.029 And then your application is a mixture of a smartphone app and a web application, right?
Anugrah S. 00:31:59.857 No, we don’t use a web application, [crosstalk].
Chris Churilo 00:32:01.681 Oh, okay.
Anugrah S. 00:32:02.922 We have web application, but that is specifically for the enterprise users. It’s not actually for the customers, it’s, we have a kind of a B2B sort of applications that’s actually web based and for the discovery, of course, there is web applications, but it’s majorly an app. It’s not actually a web app.
Chris Churilo 00:32:25.348 Okay, cool. So we’ll keep the lines open for a few more minutes if anybody has any questions. In the meantime, I just want to remind everybody that this is being recorded, and after I do a quick edit I will post it. And we’ll just continue to hang out here and just kind of chat a little bit more until we hear any other questions. So have you guys taken a look at any of the other new features that have come along with InfluxDB, or Telegraf? Have you started looking at Kapacitor or have you looked at the new Telegraf syslog plugin? Are there any other things you think might be useful that you might consider far down the road?
Aishwarya Kaneri 00:33:12.002 So we are using the syslog plugin of Telegraf, and Kapacitor was already—we were using it for quite a long time. Grafana, Kapacitor, and InfluxDB, but after we started facing the issues of high memory and disk utilization, that’s when we explored the solution of data downsampling, so yeah, that’s it.
Chris Churilo 00:33:37.895 And so with the Telegraf system, what do you guys use it primarily for, and have you had any success in finding issues in your environment? This is a fairly new project, I just always am curious to hear from our users.
Anugrah S. 00:33:55.649 Telegraf is—
Chris Churilo 00:33:57.465 [crosstalk]. The syslog Telegraf plugin.
Anugrah S. 00:34:01.506 Syslog…
Chris Churilo 00:34:05.622 Oh, we got a response back from Sarvana. He says, “Thanks for taking the time for explaining problems and solutions, much appreciated.” So maybe you guys aren’t using the syslog plugin but you’re just using a bunch of the standard Telegraf plugins, which is completely fine.
Aishwarya Kaneri 00:34:25.397 All applications—they will log. It is like a standard [inaudible]. And the Telegraf gets—we have added a plugin to get all these syslogs and push them to the corresponding, the InfluxDB you are in.
Chris Churilo 00:34:45.129 Okay.
Anugrah S. 00:34:45.326 [inaudible] plugin is—Okay—[inaudible].
Chris Churilo 00:34:52.044 So do you guys have any advice to anyone that might be new to InfluxDB? So you guys had went through quite a journey, and you’re now in a happy spot with the InfluxDB, but if you were to start all over again what kind of advice would you give to yourself that might be useful to our audience today?
Aishwarya Kaneri 00:35:14.165 One advice would be if I’m starting a project now, I would switch to a newer version of InfluxDB using TSI model because right now, because we have so much data in our database, it’s becoming risky as well as difficult to migrate to TSI model from DSM model, so that is one major change which I would like to do if I was starting a new project right now.
Chris Churilo 00:35:46.775 So Georgio actually asked, since you mentioned that you have so much data, how many series are you currently managing in your infrastructure?
Aishwarya Kaneri 00:35:56.241 So every second we were getting 55,000 data points, and before they put the config of the max limit on the number of series, which is right now 20,000, but there was few things which were sending a lot of series, like the highest was from one team sending more than 40,000 series. But we had to put that limit, because the DB was breaking constantly for that team. So after we have put the series [inaudible], they went back and they checked all their logs, what kind of logs they are sending. Like a team shouldn’t be sending all sort of logs—they should not just dump all their logs in the database. The logs should be useful.
Chris Churilo 00:36:50.034 Yeah, that’s actually a really good point that you bring up. It’s really easy for anybody to say, “Oh, let me just throw everything over there,” but there’s no point, right, if you’re never going to look at some of that stuff, why do that?
Aishwarya Kaneri 00:37:03.349 Yes.
Chris Churilo 00:37:04.305 So Georgio, hopefully that answered your question. And I think—oh, here we go. So he just wants to make sure that he understands, so 55,000 data points per second? Did I understand that well? Georgio asks.
Aishwarya Kaneri 00:37:21.159 Yes.
Chris Churilo 00:37:22.465 Okay. Georgio, do you want—he says thanks. If there’s any more clarification that you want, now is the chance to ask that…And then Anugrah, what advice would you give to the audience or to your former self if you were starting over?
Anugrah S. 00:37:44.287 I would say that you need to do a lot of research first, before actually going for hard fixes. Hard fixes is not the best approach to deal with the problem. That’s the only solution I think. I’ve been happy with InfluxDB. [inaudible] that’s a valid [inaudible], whatever I should have said makes much more sense. But the only advice I think that I would like to give is that read well and decide on an actual solution rather than doing a hard fix. It never works in the long run.
Chris Churilo 00:38:20.467 Yeah, no, that’s really good advice. All right, looks like we don’t have any other questions. Do you guys have any other last thoughts before we end today’s webinar?
Anugrah S. 00:38:33.898 Yeah, any feedback for us? This is our first webinar. We have never done this before.
Chris Churilo 00:38:38.851 You guys did just fine. And if you guys do have questions afterwards, feel free to send me an email. I’d be more than happy to send it to our speakers today, and as I mentioned, I will do a quick edit of the recording and post it so you can take another listen to it. Georgio says, “Thanks guys, very nice webinar,” and I agree. You guys did a very nice job today.
Aishwarya Kaneri 00:39:04.113 Thank you.
Chris Churilo 00:39:04.113 And with that, I think we’ll end our session a little bit early, but very nice, rich, and short but very rich content today and we really appreciate your time today. So thanks for sharing your use case with us today, guys.
Aishwarya Kaneri 00:39:19.117 Thank you.
Chris Churilo 00:39:20.668 Thank you, and everybody have a pleasant day and we’ll see you again next time. Bye-bye.
Anugrah S. 00:39:27.080 Thank you. Bye-bye.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.