How and Why Gravitational Uses InfluxDB to Monitor Kubernetes

In this webinar, Sasha Klizhentas from Gravitational.io will share how they chose and use InfluxData to help them monitor the Kubernetes clusters that they offer in their SaaS offering.

Watch the Webinar

Watch the webinar “How and Why Gravitational Uses InfluxDB to Monitor Kubernetes” by filling out the form and clicking on the download button on the right. This will open the recording.

[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]

Here is an unedited transcript of the webinar “How and Why Gravitational Uses InfluxDB to Monitor Kubernetes.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.

Speakers: - Chris Churilo: Director Product Marketing, InfluxData - Sasha Klizhentas: CTO, Gravitational

Chris Churilo 00:00:08.819 Okay. Let’s go ahead and get started. Thank you for joining us today at our webinar. Today, we have Sasha from Gravitational, who’s going to be describing what they do at Gravitational as well as their experience with InfluxData, and how they use it to help them monitor Kubernetes. So with that, I will just go ahead and turn it over to Sasha.

Sasha Klizhentas 00:00:31.693 Thanks, Chris. And thanks, everyone for joining. My name’s Sasha. I’m a CTO at a company called Gravitational. And today, I’m going to tell you how InfluxData helps us to monitor Kubernetes, both as a system component and applications running on top of it. I want to tell you a little bit more about Gravitational, and we’re a two-year-old infrastructure startup. We’re a Y Combinator 2015 company, and we help companies to deploy and manage complex applications across distributed infrastructure. I want to tell you a little bit more about that, so let’s take an example. Imagine a company, a SaaS company, a really popular one or just starting. And usually, SaaS companies, they have a lot of microservices nowadays. They use a lot of databases, and as they gain popularity, there is always this moment in time when there’s someone comes to them and say, “Hey, I really like your startup. I really like your company and service, but unfortunately, I cannot use you on the shared cloud infrastructure due to regulations or any other security measures. So can you deploy this system-can you deploy your service on our data center or on our private AWS or Azure account?” So now, the company faces the choice. What do they have to do? If they choose to ignore it, they’re basically losing money, right? But if they accept the offer, they effectively end up with two different versions of their infrastructure deployed, right? And one of them, they don’t actually control, which is a really complex problem.

Sasha Klizhentas 00:02:09.812 Imagine if they agreed, right? There’s more and more companies coming in, and suddenly, they’re faced with dozens of different deployments of their microservice architecture across the globe where they don’t have access to. So that’s a problem we helped to solve, right? We help companies to solve this problem by onboarding them to Kubernetes, and then, by helping them to deliver this Kubernetes to all these different remote environments. So that is what Telekube does, right? And I will explain how InfluxData helps with monitoring Telekube and different components of it. Telekube is our multi-regional Kubernetes for deploying and managing those applications. So once we deployed all these remote infrastructures on AWS Azure, our customers can use our tool pod Ops Center to remotely access it, if customers give them access, or just to distribute updates if customers just want updates and don’t want anyone into their private data center that sometimes don’t even have internet access in the first place. So if you think about what problem InfluxData helps us to solve, right, we end up with about 100 production clusters. These clusters are not really big, but there are a lot of them. The average footprint, I would say, varies from 3 to 20 nodes per cluster. We have a little bit of one node cluster just for demo purposes, but we don’t [inaudible] have to support them in anyway except just pulling updates and stuff like that. So that’s the problem that we’re having. The number’s approximate because as I mentioned about the nature of our business, we don’t have a lot of access to those data centers, to those clusters. So actually, this is a lot of effort from our side and from our customers’ side to keep those clusters running. That’s why the cluster should be really resilient, highly available, and generally engineered correctly both from the customers’ side and our side.

Sasha Klizhentas 00:04:04.079 So I’ll split this webinar into three parts, right? In the first part, I’m going to tell you our journey in choosing InfluxData in the first place, the tech diligence steps we have taken and the databases we have evaluated. And the second part and third part, I’m going to show how we monitor Kubernetes with TICK Stack. We’ll try and do a live demo of breaking the cluster and getting an alert. And I will try to write an application, deploy on top of Kubernetes, and then, see how TICK Stack can help us to troubleshoot this application. So let’s go to the first part of the presentation of the Time Series Database that we have been looking at. We have filtered out all SaaS applications, right, just because we’ve deployed to these air-gapped environments where they cannot use SaaS or they cannot use that in the first place due to industry regulations. So we have looked at primarily five or six different databases: Open Time Series Database, Prometheus, InfluxData, and a couple of others. So I wanted to highlight our pros and cons for all of those databases, and to show you our thinking, and why did we pick InfluxData after all these evaluations. The first one we have looked at was Prometheus, right? And it’s not a coincidence because Prometheus is really popular in Kubernetes ecosystem. It is also a really performant one. The benchmarks show that it supports up to 800,000 writes per second per node. And by the way, all the benchmarks I’m sharing are available on this link of this third-party company that was doing their evaluation. Obviously, the benchmarks can vary from infrastructure to infrastructure and footprint to footprint, but that’s kind of a nice view of what Prometheus and all these databases can do as well. It also has as a metric first-class Kubernetes integration. And the community does a lot of effort into integrating Prometheus exporter and generally providing good experience for Prometheus. And it has a low operational footprint which is real important for us because we cannot deploy databases that consume a lot of RAM or a lot of CPU or a really high [inaudible] rate, right, due to the fact that we have the distributed components.

Sasha Klizhentas 00:06:28.095 We ultimately didn’t pick Prometheus for one simple reason. There were many reasons, but this was the most important one and to highlight it, I quote Brian Brazil, Prometheus core developer. “Prometheus is not intended as durable long-term storage. It is fundamentally limited to the size of your machine. You should also design your monitor to be able to tolerate completely losing the data of a Prometheus.” That is not the case for us because we have to guarantee to customers for a variety of reasons, security is one of them, audit purposes, and just resiliency that we retain the data that we, for a certain amount of time, we can configure retention. For some customers, we can store it for several years. From some customers they are used to several months. But none of them is okay losing this data if you lose the machine. Right? And that was the big problem for us. That’s why we didn’t pick Prometheus. The second problem is the single machine part, right? So the absence of data retention and the absence of high availability storage. The second class of databases we have looked at is highlighted by OpenTSDB. OpenTSDB is a Time Series Database, right? It’s based on either Cassandra or HBASE as a storage backend. It is also really performant. The benchmarks that I shared with you, and I will send you the link, after the webinar with this presentation with all the links and material so that you can share it with the audience as well. But it supports 80,000 writes per second, right? And I think you’d actually go beyond that because my former colleagues at Rackspace were working on a similar system called Blueflood also based on Cassandra, and they were able to reach a million writes per second more for their infrastructure of monitoring all Rackspace cloud, right?

Sasha Klizhentas 00:08:17.212 We didn’t pick this one too mostly because of its huge operational footprint, right. It’s based on HBASE, it brings ZooKeeper, it brings in Java and if it breaks, it’s really hard to troubleshoot just because it’s a really complex system. So it will work well, I think, if you have a really big cluster that you have to monitor, and you have a team dedicated to monitoring it. But it won’t work well if you have dozens of hundreds of clusters of OpenTSDB distributed everywhere. And the last but not least is InfluxData. It has a lot of pros, right, that we were attracted to. It’s also very performant. It’s, of course, 350,000 writes per second and it also has very low operational footprint. It’s written in Go. It has low memory usage even on a bigger footprint and higher scale, as we noticed. It has a very important feature for us. It’s called Retention Policies and Rollups. And that’s what we were able to roll up for our customers as well. Guarantee that we retain certain data for a certain amount of time and providing smart pro ops systems that our customers can control as well. And it has a really good upgrade story, right? It has both Enterprise on-prem offering and it has managed cloud version. Both come with clustering support which means that our data center customers can upgrade to the Enterprise version if they want high availability, and they can go to managed cloud version if they are on their private AWS account. And obviously InfluxData’s not perfect, right? It has cons as well that we have evaluated. Some of them I would like to highlight. Right now, it has slightly weaker Kubernetes integration then Prometheus. The other one is that Kapacitor is actually a really hard tool for us to use and our field constantly tells us, “Hey. Kapacitor’s a really powerful tool, but can you explain to us how to use it properly?” Right? And it has no open source clustering built-in. So that was the con.

Sasha Klizhentas 00:10:24.126 So in spite of all those problems and cons with InfluxData, we still picked it because it’s a unique product that provides these features like clustering, commercial support, and data retention that matters a lot for us. That’s why we picked InfluxData and that’s why several years ago, I think 2015, we rolled out our first integration with InfluxData for the Kubernetes clusters that we deploy. Now we’re going to the second part that tells a little bit about monitoring Kubernetes with TICK Stack. And I will go through different components of the Kubernetes that we monitor. And before we do that actually I wanted to bring the cluster as I mentioned. So what I’m going to do now-I’m going to spin up this little program called Busyloop and the only thing that it does actually-I run Python and it just starts a couple of threads of Busyloop just to consume CPU, and I wanted to show how we get an alert after that happens. So let me wire this up. All right, so while it’s working it will get us some data in our cluster monitoring tool that I’ll show you here. But I want to give you a little bit of an overview of how do you exactly monitor Kubernetes with TICK Stack. So when we deploy Kubernetes, right, we roll out several monitoring tools. The first one is actually Kubernetes itself. It collects the data information and it has a lot of checkers internally that can provide quite interesting information about the cluster state.

Sasha Klizhentas 00:12:34.298 For example, we have filesystem monitoring, we have actual internal monitoring inside Kubernetes, and a variety of different checks whether the node is ready or not. On top of that, we have a distributed checker called Satellite. So this Satellite is a stable system. It is deployed in the cluster. And it is designed to combine with checks from Kubernetes, from the operating system, from all these various components. And it communicates through the peer-to-peer protocol. That’s an open source system that we have. And its only purpose is to send the measurements to InfluxData, and alert us if something goes wrong with these low-level components. For example, if InfluxData itself goes down, or Kubernetes goes down, we will still get some information out of Satellite that is using peer-to-peer information, and propagates this information across clusters. So as long as we’ve got still one of them running, we still get some information about what went wrong. But for high-level analysis, for metrics, measurements, we send this all to InfluxData after the fact. So, as I mentioned, I’m going to break the cluster. And while waiting for the alert to arrive, we’ll look at the filesystem and different aspects of TICK Stack that we’re monitoring. Kubernetes is a distributed system. However, it runs on pretty standard Linux setups. That’s why we have to monitor the operating system itself, right? And that’s why we monitor filesystem, network, and other components of the Kubernetes, like Etcd, Docker itself.

Sasha Klizhentas 00:14:21.880 For example, when we monitor Kubernetes-when we monitor filesystem, Docker uses inodes, right. On one of the deployments of Docker, filesystem storage layer, it tends to consume a lot of inodes. And we’ve seen clusters go down a lot because of that. To monitor that, we simply collect filesystem inodes measurements, and we also have the overall inodes available, right? So whenever it goes through a certain threshold, we will get the alert. We’re also monitoring different components of the networking step. To give you an example of the networking components that we monitor is Bridge netfilter. And Kubernetes uses Bridge netfilter plugin, which is a general plugin, that allows it to filter the packets in one of the Linux bridges, right, in four different configurations. So let’s imagine that security teams turn this off, right, and that happens actually in on-premise infrastructures quite a lot. The side effect of that is that if cluster goes down, it doesn’t have overnight network functioning. But it’s actually really hard to troubleshoot because it’s not clear what’s going on if you, let’s say, start tcpdump and start looking at the packets, you are not always seeing it really fast. For example, one of our engineering teams, implementation services teams, once spent four hours trying to troubleshoot. And it turns out that it was just this little internal plugin turned off. That’s why we wired out this deadman in Kapacitor, which is, if you are not familiar with it, a special checker that alerts on absence of certain data. So when we don’t get an alert-when we don’t get this information that Bridge netfilter is on, that admin fires up in Kapacitor and we get this back.

Sasha Klizhentas 00:16:12.756 We did the same for Etcd. Etcd is the database that is used for Kubernetes itself. It’s a distributed, highly available database. It’s usually deployed in the clustering mode. However, it’s also pretty hard to operate as well, because it is really sensitive to certain parameters of filesystem. And that’s why we monitored Etcd, in the first place, so when a Etcd cluster doesn’t function properly, it also has internal self-check. Right? And it tells itself, “Hey, I’m not available, and we will alert both ourselves and the customer that something is wrong.” You can see that it will send an email through Kapacitor or we’ll just output to the log, which customers can look at and forward to their own monitoring system. We do monitor different aspects of Etcd. Not only do we monitor the liveness of the cluster, we also monitor different low-level components of Etcd like latency fsync’s. To give you a bit of an overview about that part, imagine if it’s an ecluster of fsync’s or something. So it writes something to disk and there’s a latency spike. For example, you’re using EBS, or a slow disk, or a network filesystem. Right? In this case, its Etcd cluster breaks because is it not able to flush its data, to fsync the data in a certain amount of time. And you will notice all strange configurations of the cluster breaking. It’s real easy to troubleshoot if you know what you’re looking for. That’s why we wrote the Kapacitor alert that will trigger this warning on high latency spikes. We monitor Kubernetes itself. As I mentioned Satellite, our checker, we’re thinking about Satellite as, if you are familiar with this tool called Modit, as a distributed Modit that basically collects all those metrics and forwards them. We also take self-assessment information from Kubernetes itself, and whenever Kubernetes node is not ready, it can also send an alert, and we display this in the dashboard as well. However, in some cases, it is not really viable to send an alert. If you have a 30 mil cluster and 1 node went down, you probably don’t want to get an alert as you will want to look at this later.

Sasha Klizhentas 00:18:22.561 So for this, we also use Kapacitor to say, “Hey, the majority of the cluster is functioning, so I’m not going to trigger an alert to trigger PagerDuty. But, I’m just going to send a warning that will go through other ways.” Right? For example, just send an email without triggering the PagerDuty just because it has arrived to a different email address. And I wanted to show you how do we sheet these alerts. Right? With Kapacitor, it can write TICK Scripts, right. It can send these tasks to a Kapacitor. Well, we actually use Kubernetes itself, to distribute those alerts. In Kubernetes, there is a configuration map object called ConfigMap, and all our alerts are structured as this ConfigMap objects. So I want to show you a little bit how it looks like. If you notice, there is a ConfigMap called Kapacitor alerts. The way to think about ConfigMaps, if you are familiar with Chef or Puppet, they have these databanks properties. Right? This is very similar to them, so if you look at one of those alerts, you will see that we have a ConfigMap with all of the alerts in there and our automation of what runs inside Kubernetes scans for these new configuration maps created by us, by our implementation services teams, by customers. It automatically adds these alerts to the system, to Kapacitor through tasks, right? It’s a very convenient way because then our implementations services team, where we don’t need to think about Kapacitor itself, we can only focus on Kapacitor alerts. Right? Without thinking,” How do I interact with the Kapacitor? Do I have permission to talk to the Kapacitor? Do I have permission to create this alert?” So Kubernetes provides an interesting frontend for this, because I can now use roll-based decks to control Kubernetes to create these alerts. And if I’m not authorized to create these alerts, I won’t be able to create it. If I’m only authorized to view these alerts, I won’t be able to do this. So this is the [inaudible], the ConfigMap header. And, all these alerts are in here. I will create another alert during this demo, for one of the applications running in the cluster and we’ll show you how easy it is to create Kapacitor alerts in the third part of the presentation.

Sasha Klizhentas 00:20:44.985 So let’s see if we’ve got an alert and check in for a quick-we haven’t got it yet. But, let’s check in a moment. Anyway, let’s look at the data that we have in the cluster. As I mentioned we are using TICK Stack. Right? So we’re using InfluxData on the backend, we’re using Kapacitor and we’re using Telegraf. On the frontend, we’re not using Telegraf, as we basically integrate Grafana. And I want to show you a little bit of how it looks from outside. We have overall cluster usage tab that shows you the lightness of the whole cluster, right? As I started this Python process that consumes two quarters on my two-quarter machine, right-well, almost two quarters. We can actually notice that we’re getting now, 1.6 thousand millicores. In Kubernetes terms, one millicore is roughly one-1,000 millicores is roughly 1 core. So we can see that we are now consuming 1.6 thousand millicores, So roughly, 1.6 cores, out of the 2-core cluster. And, we actually can pinpoint the node that takes all this CPU consumption, right? Which is pretty helpful, and right now we’ve actually got a warning about the high CPU usage on those nodes. So it’s really helpful because we can now say, “Hey, this is exactly the server that we were having problems with,” so we don’t actually have a 30-node cluster to look at which one is currently misbehaving. So Kapacitor’s quite helpful for that.

Sasha Klizhentas 00:22:38.706 All right. I wanted to move to-before we move to the third part of the presentation, I want to show you a little bit about this overall cluster health tab we have. Other things we monitor except millicores is RAM consumption. So we have overall memory usage, and, thanks to InfluxData, we can say, “Here’s the overall cluster memory available to us through the aggregate measurements needed by Telegraf or the nodes.” You also monitor filesystem usage, network usage, and other different low-level components in different tabs, right? We also have, not only cluster-wide view, we have application-centered view that is built around pods. And I will show you how we monitor pods in a moment. But before that, let me actually shut down this [inaudible] process that consumes all this CPU before moving on to the third part.

[silence]

Sasha Klizhentas 00:24:00.459 Let’s see, I probably actually don’t have anything running right here. Oh, no, here it is. All right. So we are done with this busy-looking process, and let’s move on to the third part of the presentation. And the third part is focusing on monitoring applications with TICK Stack. The application, in Kubernetes terms, is basically a container. The Docker container that is deployed in the cluster, Kubernetes takes care of its replication factors through its configuration settings. And due to the nature of deployments inside Kubernetes, imagine you have 100-node cluster. And on all these nodes there is an application that doesn’t behave correctly, and you want to find it and understand what’s going on with it. But instead of jumping on all these 100 nodes, you have to have really powerful automation to show you where this application is running in the first place. And most likely, you want to spot anomaly for this application, right? Not only look at all these hundreds of containers deployed there on your hundred-node cluster, or even thousands of containers deployed there. So if you’re working with a system like Kubernetes, it really raises the bar for your monitoring game, right? And you have to spend a lot of time instrumenting this. So I want to show you how easy it is with Kapacitor to spot the application that doesn’t behave correctly. To do that, actually, I wrote another application called Loop. It is actually very similar to the previous one. It’s actually way simpler than that. It basically does nothing except looping in this “while True” Python loop. It roughly takes one CPU out of 2 CPU node, right? And I also wrote this Docker container file. You see it is very easy to deploy something to Kubernetes that will behave incorrectly. So let’s use just that.

Sasha Klizhentas 00:26:09.079 I’ve pre-built it, so the build is very fast. I’m going to push this to node registry. And if you’re not familiar with this concept, basically what I’m doing right now-I’m building the Docker container. I’ve built it, and I push this Docker container to the cloud registry hosted by a quay.io. And now this container’s available there with my Python process and I’m going to track my Kubernetes cluster to deploy it through the deployment object. So that’s a little bit of Kubernetes 101. Let’s look at this object. It has a lot of different primers, but we are mostly concerned about the fact that we are going to use this container. As you see, I’m referencing this container by full name in quay.io registry service and I only need one rep out of it, but I can also scale it up to many replicas after the fact. So let’s do this. When I’m creating the deployment in Kubernetes, it basically creates these containers for us. And it groups the containers in these concepts, very specific concepts, called pods. So pod, in Kubernetes is a group of containers, right, that are running together. In our case we have only one container on one node, but you can easily imagine that there can be many different containers writing in the system. Both, as a part of the single pod, or many, many different pods distributed across your infrastructure. All right, we have deployed this, and now we can go to the applications centered view, right? We’re going to our built-in Kubernetes tab. We probably won’t be able to locate this, and we will go to the monitor tab. It will start to collect the data shortly and we’ll wait for it, but meanwhile, while we’re waiting for this data to start aggregating and collecting, I want to go to the Kapacitor part of this back again, to show how easy it is to create a Kapacitor alert inside the Kubernetes cluster, [inaudible] and understand what’s going on with individual pods running.

Sasha Klizhentas 00:28:22.443 So let’s do just that. I’m going to go to one of the nodes where I have prepared this alert already. And as I mentioned, that’s the configuration of that object, although we have a little bit of wrap around it as well, to help us to create this alert. And let’s see, because we know that this pod will consume a lot of CPUs we can predict something that will help us to catch those problems in the future, right? Let’s look at the Kapacitor measurements and Kapacitor streams that will help us to catch this problematic pod. The first one, we will collect the user traits from the overall cluster, right? And we will consume the node capacity that Kubernetes nodes emit in middle course, right? And they provide this InfluxData measurement tag where you can see the node name and the node capacity and CPUs, right? So that’s one data stream. The other data stream that we are going to use is the user trade grouped by pod and node, right? And so every pod will emit how many millicores does it use at that particular moment of time, right? And we’re collecting the measurements of the type pod and grouping by those two tags, the pod name and node name. And at the end we’re basically calculating the usage rate, so we’re trying to catch if there’s any pods running inside Kubernetes cluster that consume more than certain percentage of the CPUs available on this node, so. And then we’re going to trigger a warning if it will exceed certain thresholds that we just set up there. So it’s a very simple tool, very simple alert, and let me create it.

Sasha Klizhentas 00:30:24.004 So this is a tool called Gravity. This is just a little wrap around Kubernetes, effectually creates this ConfigMap with the proper label that I’ve shown to you before. And our automation that runs in the cluster reconfigures Kapacitor. So I want to show you actually how to list Kapacitor tasks so one of those Kapacitors is running in the cluster right now. And as you notice, here’s the Kapacitor that was automatically configured with our pod-high-cpu Kapacitor task.

And it basically takes this pod-high-cpu Kapacitor name, goes to Kapacitor, reconfigures this task and creates it. So it probably will-after a certain amount of time, we should get an alert about this pod consuming a certain amount of CPU. So let’s go to our monitor tab and see if we actually get the data by this time. Yeah. So the scale is a little bit off, but we can see that we’re using 999 millicores which is roughly 1 core. And we’re not using a lot of memory in this particular moment, just 2.4 megabytes, which is around 2.4 megabytes effectually in Kubernetes terms. And we’re not using a lot of network or anything else. So if we know out of those pods which one behaves incorrectly, we’d figure out, “Oh, okay. So it uses 100% CPU.” However, on a large cluster, we should get something from an anomaly detection system that should alert us about the pod. Let’s see if we got an alert. And that’s what we’re guessing. You see we’ve consumed half of this Kapacitor of this node, that’s why we got this alert from the system.

Sasha Klizhentas 00:32:12.053 And that constitutes my demo. I think we just went through the monitoring-how easy it is to monitor applications with TICK, how easy it is to deploy alerts-Kapacitor alerts-inside the cluster. And I will move on to your questions.

Chris Churilo 00:32:31.224 Oh, thank you so much. That was really great. Very thorough and a very interesting use case for InfluxDB. So at this point, if anybody has any questions, feel free to put them in the chat or the Q & A panel. I don’t see any right now, but please don’t feel shy, we’d love to make sure that we get your questions answered here. So, Sasha, how long did the evaluation process take you guys?

Sasha Klizhentas 00:33:03.776 Well, I think in 2015 we spent several months just looking around different systems. We were not in an immediate hurry. Right? So we built a bunch of prototypes. We had a bit of experience with InfluxData before. So before Gravitational we were working at a company called Mailgun, my company started up, later acquired Rackspace. That’s how we know Blueflood Team because we were working at the same office. So we did an experiment with these different databases. We looked at them for a couple of months, yeah, before making the final decision. We looked at different benchmarks. It was helpful to look around and to look at this website-this helpful website- that has these big data sheets about their writes per second and different features. So that was helpful as well.

Chris Churilo 00:33:53.919 So if you were to do it all again, what advice would you give yourself?

Sasha Klizhentas 00:34:00.379 You mean the technical evaluation step?

Chris Churilo 00:34:01.768 And also the implementation as well.

Sasha Klizhentas 00:34:05.728 Right. So I wish we did better Kubernetes integration of InfluxData from the start after the evaluation. Right? Because the integration that I showed you is really helpful, all right, both for our us and our customers. That would be the advice I gave myself while doing this integration. And second probably during the technical step, don’t worry much about the performance. Right? As I was focused on a lot of performance and writes per seconds even as you notice from time diligence notes, right? Although for our use case, it was not really as important as the high availability and data retention that turned out to be really important for some customers. Unexpectedly actually, right? So people were asking, “Hey, we want this data to be retained for a year.” Wow. Why? Why would you want to do that? They’ll say, “During [inaudible] audits, right, we want to see what happens on the cluster at that time.” So that in part-that information is actually part of our audit system that’s why we want to have it for a while. And that’s why we had to build these rollups that are configurable at measurements. They’re also retentions. They’re also configurable, right? We wouldn’t be able to do this with, I would say, pretty much with any other database than Influx that easily. So that was helpful. So don’t focus on performance, focus on actual customer requests, I would say.

Chris Churilo 00:35:25.838 That’s really good advice [laughter]. You have a question in the Q and A? Do you see it and-

Sasha Klizhentas 00:35:33.649 No. I actually don’t.

Chris Churilo 00:35:34.380 Okay. I’ll read it out loud for you then.

Sasha Klizhentas 00:35:35.670 Okay.

Chris Churilo 00:35:36.105 So as a relatively new company, do you have competitors and did InfluxData help you gain a technical competitive advantage over them?

Sasha Klizhentas 00:35:44.656 Yeah. I think you can say we have a couple of competitors. Because we’re Kubernetes, right, there’s probably a dozen companies that show up that also do a lot of Kubernetes stuff. However, we’re trying to explain to our customers that we’re not a [inaudible] or not just generic Kubernetes company, right? What we’re focusing in is actually distributing the applications of this infrastructure and we’re using Kubernetes as a tool, right? However, people compare us to many different companies out there. And there’s also a company called Replicated that has somewhat similar product although they’re not as focusing on management, they’re focusing more in billing. The InfluxDB helped us to get this operational angle, right? The competitive advantage as I mentioned: number one, we appeal to companies that are really sensitive about audit logging [inaudible] security, right, with this retentions. And the second one is that because InfluxData’s highly available and we have really good upgraded story right, to Influx product, we can always tell customers who care about their high availability resilience, “Hey, just upgrade to cloud, right?” You just cut your operational cost because you don’t have to have a dedicated team supporting your open TSDB cluster, right? You can just completely off-load this to the Influx team which is really helpful, right? That’s a really good angle, both for InfluxData and Gravitational.

Chris Churilo 00:37:19.625 So what served as the inspiration to start Gravitational?

Sasha Klizhentas 00:37:24.118 Is that one of those questions, or are you just curious?

Chris Churilo 00:37:26.140 No. I was just curious. Just listening just now I’m like, “How did this even get started?”

Sasha Klizhentas 00:37:32.173 Well, as I mentioned we were working at Rackspace for a while, right, after the acquisition of Mailgun, and we noticed some of these companies coming in and say, “Hey, we want this SaaS offering on our infrastructure.” And Rackspace was like, “Wow. Those are micro-services. How are we going to manage that?” And that was pretty much it, right? That is the inspiration for Gravitational and that’s how they actually came to existence. Just solving this problem, right? Having a hundred Kubernetes clusters all over the world that you don’t have access to, you have to monitor them, you have to make them resilient and highly available.

Chris Churilo 00:38:10.100 Let’s see. Raymon asks: “For Kapacitor learning, if I heard correctly, you’re using Kubernetes or a back layer. Have you looked into using proxy authentication for Kapacitor and used the API interface for uploading TICK Scripts?”

Sasha Klizhentas 00:38:24.461 No. We haven’t yet. It’s a very good idea. To be honest with you we’re a little bit still learning a lot about Kapacitor itself, as I mentioned, right? We understand Influx backend really well, but we still need to know a big deal. We still need to learn about Kapacitor itself. So we may look at this. I think the next integration steps for actual Kapacitor integration we want to take is to confirm these ConfigMaps to third-party resources in the previous notation of Kubernetes or CDR as they are called, I think, today. So it’s Custom Resource Definition scripts. Effectively, they allow you to build this domain-specific resource definition. So not just a random configuration object map stored in Kubernetes but a Kapacitor-specific configuration to write an object stored in Kubernetes. And for example, when we will create this Kapacitor alert, we will get a feedback from a controller of the whole system say, “Hey, your alert isn’t correct. Because right now, we don’t get this feedback.” So that’s where we want to go. And maybe we will look at how to use proxy for that purpose. But just to give everyone an overview what is our backend Kubernetes, it helps to control access to those alerts. So role-based access control in Kubernetes-let me go back a couple slides here. So these different objects inside of Kubernetes are called ConfigMaps, deployments, and all different objects. So in Kubernetes, you can create roles and say, “This user or this group of user is authorized to create this object in the first place.” For enterprise’s case, this is really good because you can deploy a cluster, authenticate user to this cluster, and this user won’t be able to create it. Unless the user is authorized to do so. That’s why using configuration maps and Kubernetes objects is really helpful, right? Because when you’re using them, you can plug in this [inaudible] system. That’s what we want to do next with Kapacitor actually.

Chris Churilo 00:40:26.684 All right. Any other questions feel free to put them in the chat or the Q and A. And then Sasha, if you have time, you should come over to our office on Wednesday. Michael, who’s one of the engineers on Kapacitor is going to be doing a workshop on Kapacitor. So maybe he can help you with some of this stuff.

Sasha Klizhentas 00:40:47.424 Awesome. Can I bring my whole team [laughter]?

Chris Churilo 00:40:50.033 I don’t know if we’ll have enough space. But-

Sasha Klizhentas 00:40:56.088 I’m just kidding. I’m just kidding. All right. Thanks for the offer.

Chris Churilo 00:41:00.914 All right. So we’ll hang out here for just a couple more minutes. Feel free to put your question either in the chat or the Q and A panel. And as I mentioned, maybe just to get everybody a little bit primed for some questions, we will post this video on our website. And we actually don’t have any webinars next week because of the Thanksgiving holiday week in the US. Just wasn’t able to get anything organized with our users. But we’ll start up again the week after that. But we do still have a training on Thursday. And those of you guys that are coming to InfluxDays this week, don’t forget there’s also the workshop on the following Wednesday. And Michael from the Kapacitor team is going to be there. He’s really excited to meet a lot of you guys. And he’s a fantastic trainer. If you guys have been on any of the trainings, he does a really great job. And he gets really, really geeked out when any of our users have any tough questions on Kapacitor. So please come and join us.

Sasha Klizhentas 00:42:17.649 We noticed that Kapacitor’s a really powerful tool, right, with those streams and the way it can join them and look at different windows. Although it’s actually really challenging too, right? You have to actually know what’s going on. So if they had training, it would be helpful. Do you host the webinar-style trainings too on Kapacitor?

Chris Churilo 00:42:36.218 Yeah. Yeah. So those are every Thursday we do a training. And so we go through a Getting Started series. It’s seven different trainings. And then on the 28th of this month, we actually have just all about Kapacitor, but it’s an advanced topic. So you do need to go through at least some of the previous trainings. But it’s all about, “Let’s help everyone be really efficient at this.”

Sasha Klizhentas 00:43:00.229 Nice. And do you have some sort of certifications of planning that you plan to roll for Kapacitor and just Influx stack?

Chris Churilo 00:43:07.430 Not this year. But yeah. We’re hoping to do that next year. So that’s why we kind of came up with this series of trainings because we feel like these are the things that you need to go through to get a basic understanding and-

Sasha Klizhentas 00:43:20.521 Yeah. Yeah.

Chris Churilo 00:43:21.310 Yep. Okay. So let me also send everybody the link to the training. Going to my website. And then also, we’re going to be, of course, recording all the talks for InfluxDays. So we’ll make sure that everyone can get access to that. We’ll throw them up on YouTube so everyone can see that. Let me just throw in-so here is a link to all of our events including the trainings. So if you guys can join us, we would love to have you. All right. Well, with that, I think this was a really great very in-depth conversation about what you guys do and how you guys were also using InfluxDB and Kapacitor. And we really appreciate your time, Sasha.

Sasha Klizhentas 00:44:30.264 Thanks, guys, for having me.

Chris Churilo 00:44:31.465 And I will be posting this later and we’ll also do a nice little write-up. And if you guys do have any other questions for Sasha, just send me an email and I’ll just forward them on to him. And we’ll make sure that we get them answered. Awesome, everyone. Have a fantastic rest of your day.

Sasha Klizhentas 00:44:53.068 Yep. You too. Bye-bye.

Chris Churilo 00:44:54.143 Bye-bye.

[/et_pb_toggle]

How and Why Gravitational Uses InfluxDB to Monitor Kubernetes

Watch the Webinar

Session Registration

Product & Solutions

Developers

Company

How and Why Gravitational Uses InfluxDB to Monitor Kubernetes

Watch the Webinar

Session Registration

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us