In this webinar, Jack Zampolin will go into depth on the advanced capabilities of Telegraf. In particular, he will go over how to
Watch the webinar “Telegraf Advanced Topics” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Telegraf Advanced Topics.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Jack Zampolin: Developer Evangelist, InfluxData
Jack Zampolin 00:01.924 Chris, thank you very much. Hello, everyone. Welcome to the Advanced Telegraf Training. My name is Jack Zampolin. I’m going to go ahead and share my screen here. Okay. All right. Let’s get started. Oop, nope, not that one. Okay. Today we’ll be talking about Advanced Telegraf. So we’ve got three topics in today’s training. One is hooking Telegraf into a queuing system. As many of you may or may not know, Telegraf interacts data [inaudible] with a number of different queuing systems. Another topic is the recently merged Telegraf internal plugin to get statistics about Telegraf itself and then running Telegraf in Docker, a topic that a lot of people end up asking questions about. But before we start that, just a brief overview of what is Telegraf, what have we here? Telegraf is part of the InfluxData TICK stack. Each of that T-I-C-K stands for one of our products. It’s a full stack of products for working with time series data, so anything with a timestamp. As you can imagine, in a case where you’re monitoring a lot of servers, or you have a lot of sensors that you’re recording data from, all that data’s timestamped, and the way you’re going to want to query it is by time and see some nice graphs on something like a Grafana or Chronograf.
Jack Zampolin 01:47.038 Each one of the TICK Stack components does something different. Telegraf is a collector agent so it’s how you get your data out of each different place. It’s kind of like the Swiss army knife in the stack. It’s anywhere you need to pull data from. If you have little python script that’s submitting something to standard out, if you’ve got an API that exposes data at an end point, pretty much any Java, Kernel Metrics, any of those kinds of things, so awesome. You’ve probably been to some of our prior Telegraf webinars or used Telegraf before, so let’s just dive right in.
Jack Zampolin 02:28.839 One of the very cool things about Telegraf is the ability to hook into different queuing architectures. And this diagram right here explains how it works. In this example, it says Kafka broker in the middle. It could say NSQ, or Rabbit, or a couple of other different queuing providers, but this is the rough architecture diagram. So on the left side, you’ve got Telegraf_producers. So these are Telegraf instances that are collecting data from a number of different servers. They’re probably widely deployed on a bunch of different devices or servers and they’re writing data directly to a Kafka broker. And then on the other side of that, you’d need to get the data out of the Kafka broker and into InfluxDB for graphing and storage. And you’ve got the Telegraf_consumer doing that. In the way that we’ve written these different consumer plug-ins, they are safe to run multiple copies, consuming off the same topic to distribute the work. So a classic queuing pattern of fanout or fanin, you can do either of those here. The advantages of this architecture are that, normally when you’ve got a lot of different Telegraf producers, so Telegraf instances that are producing data, they’re writing batches of maybe a couple of hundred at most on every 10-second interval. That is a sub-optimal batch size for insertion into Influx. Optimal batch sizes are between 5,000 and 10,000 values per batch. This really significantly increases ingest rate into Influx.
Jack Zampolin 04:19.015 So let’s say in a case where you’ve got around 400 servers and you’re pointing all of your Telegraf instances directly at Influx, there’s a couple of ways that you can handle that. Changing the flush jitter. So how often those Telegraf instances flush in the flush interval. Maybe you’re flushing them less frequently because the real problem here is HTTP overhead coming into the database. And the database has to handle a lot of those HTTP connections. That ends up taking memory and resources on the database box. It doesn’t give it as much resources for persisting that data directly to desk. So giving it larger batches relieves it of some of that pressure and allows it to be more performing. So batching, in this case, is important. Another nice feature, a lot of these queuing systems have persistence of their own and some sort of log, the ability to replay that log into other programs, very important. That’s another use case. In the same way, InfluxDB is a great events database. So if you’re keeping track of all your events, maybe you want to count the number of purchase events or promotion click events that your customers are generating. You’re storing those in Influx for BI reasons or other, you might want to actually do something with those events. So you’re putting them into Kafka and you’re reading them with multiple different consumers to different areas.
Jack Zampolin 05:56.726 So one’s going to your application for processing. The other one’s going to InfluxDB for storage. So these are all reasons that you could be using a queuing system with Influx. So the supportive queuing systems that we have are AMQP. That’s RabbitMQ. Big in the Ruby community, written in Elixir. It’s the most widely deployed open source cue according to their website. I’ve used Rabbit a couple of times and I actually wrote the RabbitMQ integration for Telegraf. So, well, the AMQP consumer as well as producer. So if you’re interested in that, I’d be happy to talk about that. Kafka, which is a business cueing system message bus primarily for very high powered real-time data applications. Kafka is unique in that it’s backed by Zookeeper. And Zookeeper is able to run multi data center, multi AZ. Kafka is really meant as the super high featured message bus to power applications that run across the globe. Very cool project if you’re not familiar with it. It’s extremely complicated. By the way, I also just want to go ahead and say if you have any questions while I’m doing this presentation, please ask them. It helps me to answer your questions. So please just drop those in the Q and A, and Chris will help me out with that.
Jack Zampolin 07:34.488 So we’ve also got MQTT. This a white way messaging protocol. It’s coming in IoT applications, makes it very easy to send data over low bandwidth connections. NATS sort of a—NATS and NSQ are both sort of more modern takes on cueing systems, and we provide native hosts into those. Now, what are the reasons to use queuing with Influx? I mentioned earlier, batching, more efficient rights, durable. So the ability to replay the record. So maybe you want to write with Telegraf and read directly into a Kapacitor instance as well as InfluxDB at the same time. That’s very easy to do. Maybe you have an existing data pipeline using one of the streaming technologies and you want Influx to be an additional sink or source along that data pipeline, this is a great application, and we do have many folks using Influx for that. And then, again, allow data to be used in real time by multiple applications. So maybe you’ve got events that you’re kicking over to your application to do event processing, as well as storing it in Influx to do event analysis. So if we look at a sample configuration here, these are the two plugins for NSQ, the consumer and the producer. These are pretty typical of the different queuing plugins. And I’m going to pop out of this presentation and show you the Kafka example live here in a second as well. You just need to give it a server, a topic, and a channel. Those topic and channel abstractions differ depending on which queuing system you’re using. Max in flight is specific to NSQ again, and that’s the queue depth. So setting the maximum queue depth before we start returning errors, and then data format. That’s what I do when I highlight.
Jack Zampolin 09:42.558 Telegraf supports a number of different data formats. Influx line protocol, graphite, JSON, and I believe a couple of other as well. So you can be writing data into that queue in any one of the supported data formats and read it out with that consumer as well just by switching the data format and a little bit more configuration. That makes it very powerful if you have, say, existing applications that are meeting in Graphite. Being able to codelessly consume that data into Influx is very, very nice. Also many applications where JSON, if it’s relatively flat JSON, parsing it using the Telegraf plugin is pretty easy. And then in the outputs, you just give it a topic to publish to and a data format. Again, Telegraf supports a number of different data formats. If, for some reason, you want to use Telegraf to output that data in a different format, let’s say StatsD format or whatever it is, you can do that as well. So what are the advantages of this? There’s a lightweight setup. You saw the configuration in the previous slide that you would need to write to do this. It is extremely lightweight. So this is the additional configuration you would need to add if you’re writing directly to InfluxDB right now. Let’s say you’ve got some other inputs and then an InfluxDB output, you would break your inputs into one file, and then just put that outputs NSQ, run one Telegraf instance like that. And then the inputs NSQ, you would put with your InfluxDB configuration and it would write at the end like that. It is extremely easy to setup. There’s no additional pieces of infrastructure. You’re not writing your own producer and consumer to do this. There’s less code to maintain, and it’s very easy to scale.
Jack Zampolin 11:45.603 As far as on the producer side, obviously, queuing systems are designed so that they could handle large numbers of producers. You just add more producers if you need to. And then as your consumers become overloaded, let’s say you’re running your consumers on one CPU boxes, let’s say you start using all that CPU, adding more consumers is as easy as just adding another box and running that same configuration for Telegraf on it. So it is very easy to scale. The main disadvantage of that is that all metrics are stored in memory. And again, this is on a topic of using Telegraf as a queue. So Telegraf has the exact same API as InfluxDB available as the HTTP listener, and it can output that same format as well. So you can chain it together in sort of a tree structure. This is extremely useful in the case where you’ve got metrics in these isolated environments and you want to feed them up into a single Telegraf instance, write them over the Internet to your database. So you can see that diagram there. And this is how you would use Telegraf as a queue. I messed that up earlier. I’m sorry. If anyone’s got a misunderstanding, please ask some questions. So on the left, again, we would have Telegraf instances, writing. There would be a load balancer in front of your Telegrafs that you’re using as a queue. And then, those are going to write into InfluxDB.
Jack Zampolin 13:32.201 And this is the configuration for that there. It’s essentially just a pass through. If you’re not using one of these queuing technologies and you’re in a case where you’re writing too many points to Influx, this is a great way to pick up the slack and do some of that batching yourself. Obviously, because, again, Telegraf does hold this in memory. If one of these boxes fails while you’re writing, you are going to lose a few metrics. Generally, it’s going to be a very, very low loss rate though. So let’s go ahead and see what the queuing configuration with Telegraf looks like using Kafka. So I have an InfluxDB instance running on my local machine. I have two Telegraf configurations, and I’m running Kafka on my local machine using Homebrew. This is really easy to do. It’s sort of a local development, Kafka. Think of it that way. So I’m going to start up the… so the producer looks like this. That’s not good. I messed that one up. Anyway, you get the idea there. I do not have this demo right now, it appears. We’ll do the Docker one next. Okay. Telegraf inputs internal. This was a recently merged plugin for Telegraf. It gives you statistics about how Telegraf works internally. It allows you to monitor the Telegraf collection by plugin type and alert on any issues. So if you think about maybe you want to alert if you’re getting way less points than normal. So with a plugin like Docker, the number of values you’re getting depends on the number of containers you’re running above it. There’s a number of other plugins that operate like this. That’s a very easy alert to set up. There’s a lot of data that this internal plugin gives you.
Jack Zampolin 16:09.423 There’s a few different measurements, and we’re going to run through all of them and talk about which stats we have in there, so we can think about what kind of alerts we would write against this. So the internal agent measurement has gather errors, dropped metrics, gathered metrics, and written metrics. So you can think of a number of different alerts you could write on this whenever there are gather errors. So on the gathering side of things, maybe we’re pulling an API and that API is down, we would want to know that. In the case where we’re just collecting metrics off the host, if we’re getting any gather errors, there’s something drastically wrong so we would want to alert on that. Dropped metrics, obviously, that’s in the case where Telegraf’s collecting metrics. It’s not able to flush those metrics, and they accumulate in Telegraf until the memory buffer is full. And then Telegraf will start dropping metrics. So obviously, that’s something that we would want to keep track of. And then, making sure that metrics gathered and metrics written matches up so that we’re writing roughly as many metrics as we’ve gathered. Those are all things you might want to alert on there. So under normal operation, obviously, gather errors and dropped metrics, you shouldn’t see any of those. That all indicates problems. A sample query that you might want to run on this is max difference gather errors, so figuring out how many new gather errors there were in the last collection interval. You could also use derivative from that.
Jack Zampolin 18:01.855 And then there’s an internal gather measurement. This has data about the gather time, so how long it takes for each gather. And then the number of metrics gathered. And this is going to be tagged with the Telegraf host as well as the plugin name. So for me personally, I found this extremely useful in debugging Docker issues. I end up doing a lot of Docker monitoring with Telegraf, and sometimes there’s unpredictable issues with the Docker daemon running on the host. Sometimes it’s overloaded. Maybe I’m running way too many containers because I’m doing something stupid. Or maybe there are permissions issues, gathering some data from some containers—I’ve seen this happen in Kubernetes—and that gather hangs. Being able to alert on that would be very, very nice. And then, obviously, the number of metrics gathered per input plugin, a nice thing to know and be able to track. So for internal gather, that’s for input plugins. Internal write is for output plugins. So buffer limit, buffer size, these are—pardon me. We were talking about the internal Telegraf memory buffer that’s buffering those metrics until it can flush them to the configured outputs. This is keeping track of that on a per-output plugin basis and making sure you’re not overflowing that buffer. Metrics written, obviously, we know what that is. And then write, time, nanoseconds. So keeping track of the performance of your writes, very important. Alerting if your database is taking too long to persist writes, timing out.
Jack Zampolin 20:11.347 The one I do want to mention here is metrics filtered. Telegraf has a number of options, tag pass, tag drop, field pass, field drop, things like that, that allow you to filter out measurements based on certain rules. This will tell you how many metrics get filtered for each output plugin. So if you, say, have different filters for each output, it is easy to keep track of that. Any questions? I’m just going to pause here real quick.
Jack Zampolin 21:00.556 Okay. So internal mem-stats, this is another measurement that that internal Telegraf plugin gives you. So it’s one of the options within the internal plugin [inaudible] stats. This is turned on by default and this gives you information about Telegraf’s memory usage. This is useful for performance monitoring. I wouldn’t necessarily have it turned on in your production instances, but if you’re looking to benchmark how much memory Telegraf takes up in different use cases, this is very useful. Heap GC data, amount of memory allocated to Telegraf, all kinds of fun stuff like that all in there. And again, this is useful for—let’s say you’ve got 1000 hosts and you’re going to run Telegraf on each one of them, you want to figure out exactly what that overhead is that Telegraf is taking up. This is going to give you a lot of that data. All right, and our final topic today is running Telegraf in Docker. This is an area where I do see a lot of people having issues. I’m going to go over this pretty thoroughly and then walk through it in the terminal. If you have any questions, please ask. And if you have any specific questions, please ask those as well and I’ll try to answer them in full at the end. If they’re shorter, I’ll answer them mid-stream as well.
Jack Zampolin 22:38.349 So what are the challenges of running Telegraf in Docker? As you can imagine for something that’s designed to gather metrics from the host that it’s running on, if you’re running it in a virtual host, there’s some trickery that you’re going to need to do to get all the data from the actual host. In Docker, the best process is to run one process per container. So you’re not going to be, say, running a rails app in a Docker container, and then you’re just going to install Telegraf inside that Docker container. That’s not Dockery; that’s not the right way to do it. The right way to do it is to run Telegraf in its own container on that host, and mount some host volumes and give that container a little bit of context so that it can monitor everything else on that host. So the challenges in this environment are Telegraf requires access to some files on host to collect those system to stats. We also want to monitor the Docker daemon on the host, so pulling those Docker statistics from that Docker API, memory CPU, image stats, how large they are, how many images you are, there’s a lot of great information at that API and we do need to gather that. Sometimes that can be difficult to get to from within the container. And then whenever you set up a Docker container, Docker will give it an arbitrary host name. Keeping that host name retains context, so you’re going to want to make sure you get that host name inside the Telegraf container.
Jack Zampolin 24:16.211 You’re also going to need to configure Telegraf. What is the best way to do that? Mount the config onto the container. I see some people pre-baking their Docker images with a Telegraf config in there. I personally don’t think that’s the right way to do it. It makes those images much more difficult to change and much more static. Config files are very small, easy to shoot around, and much easier to replace than a full Docker image. So just mounting your config file and being able to change that dynamically is definitely the preferred method. And I’ll talk about that in a bit. So as far as environmental variables, there are a couple you need to [inaudible] set. Host proc and host sys, so we will be mounting the host proc directory and the host system directory into our container. And we’re going to put them in a place called rootfs/proc. If we put them in the proc directory, we would override a bunch of our container files. We wouldn’t get accurate statistics for our container CPU from the Docker daemon. And there would be a lot of other issues because WinX [laughter]—don’t want to mess with it too much. So in order to tell Telegraf and the Go-Runtime the new location for those files, we need to set host proc and host sys. Also, I know another thing we need to do is assign a hostname. So here I’ve just shown setting the hostname environmental variable explicitly, probably not going to want to do that. Every different containerized environment that you’re in, you’re going to be able to get that hostname from the host a different way. In Docker or Docker Swarm, maybe you have an etc hostname file on each…
Jack Zampolin 26:24.327 Are used from into environmental variables. It’s called the downward API. And that’s what I’m doing, the Kubernetes stuff, I use the downward API to set this host name. And again, that’s extremely important for maintaining contacts.
Chris Churilo 26:43.622 Hey, Jeff, I’m sorry to say that your audio cut out a little bit. Can you just go over that flag one more time?
Jack Zampolin 26:52.979 Yes, the environmental variable form? Can you hear me?
Chris Churilo 26:56.215 Yep.
Jack Zampolin 26:56.719 Okay, good. Sorry about that. So in Docker, you’re going to need to set some environmental variables. We’re going to be mounting the prog and the sys directories, so /prog and /sys from the host into the container. We don’t want to override the container prog and the container sys directories. So we’re going to mount them in a folder called rootfs to avoid those conflicts. And we need to tell Telegraf and the Go-Runtime when it’s looking for these files, the right place to look. So instead of pulling the container data from the container operating system at rootsys or rootproc, we’re pulling it out of rootfs proc and rootfs sys. So those two environmental variables, you do need to be sure you set. The other thing we need to do is assign a host name to the container. Docker will automatically assign that container a host name. And in fact, you can set that host name with Docker itself. So just giving the container a name will set that host name. That’s one way to do it. Personally, I’d like to use the host name of the underlying host. You can maybe pull it out of the etc hostname file. That is one way I’ve seen some folks do it. In Kubernetes, there are some of these container orchestrators. You have a way to mount environmental variables via an API. So in Kubernetes, you can use the downward API to set this to be the node name. That’s what I’ve done in the Kubernetes integration that we do, but just something to make sure—be sure you set the host name variable in that container to something that’s identifiable.
Jack Zampolin 28:51.903 The next thing we need to do is mount some volumes. So the following volumes we need to mount from the host to the container var/run/utmp and then the sys and then as I said, we’re mounting that into rootfs. This is where most of the CPU memory kernel statistics come from. And then there’s the proc directory. This is going to give you some of your user information and process information. So different plugins need each of those. But the most common system monitoring plugins you’re just going to want to go ahead and mount both of those directories. The other thing that we’re going to want to mount is the Docker socket. So that’s at var/run/docker.sock. And you’re just going to mount that in the exact same place in the Telegraf container. And again, this is so that you can get those Docker statistics from the host. And then the last thing that you’re going to need to do is mount your config file. Here we’re just going to set that config file at etc/telegraf/telegraf.conf. And mounting that whole Telegraf directory into the container will bring that config file. Obviously, this is going to depend on what you’re using for your underlying configuration management as to where those mount points are. But mounting it into the container at etc/telegraf is the right way to do it. So now that we’ve done that, let me just go through and peek at some configuration here. So in the Docker case, if we’ve got stuff like inputs CPU, inputs Docker, inputs internal, inputs kernel, those kinds of things, we’re going to need to mount all those directories. I have a running instance of this.
Jack Zampolin 30:59.457 I have a running cluster with Telegraf doing some Docker monitoring. I’ve just gone ahead and logged into one of those. It’s on Google Compute. So if we look at the environment, we can see that I’ve set that HOST_SYS and HOST_PROC environmental variable as well as the host name. And this is Kubernetes. I got that from the downward API. If we look at etc/telegraf, you can see I’ve also given that host name here. If you are setting that hostname environment variable, Telegraf will use os.hostname to pull it. I like to just set it there. I’ve had some issues with that in the past and this makes sure that that’s not the case. And you can see we’re just pulling from var/run/docker.socket, and we’ve got all of those standard inputs enabled. Okay. There’s a few questions here and I’ll go ahead and answer those. Is it compulsory? So Shrinivas, hello. InfluxDB has an internal DB. I/s it the same or different? So the underscore internal DB in Influx contains data about the InfluxDB process. This internal plugin will write data about the Telegraf process. So different processes, different internal statistics, but same idea. Does that make sense?
Chris Churilo 33:06.087 I’m actually going to let Shrinivas talk.
Jack Zampolin 33:09.824 Okay.
Chris Churilo 33:12.204 Because he always has lots of good questions.
Jack Zampolin 33:14.788 Okay. Good. And then the next one, Telegraf config has hostname (What does it do?). I’m using a tag on every node using Ansible to auto populate with the actual host name. So Shrinivas, you saw that—here, let me see—this hostname equals host name Telegraf in the config file. You can actually natively reference environmental variables just by their normal syntax. So putting that host name in there will interpolate that host name out of the environment. So that’s how I’ve done it and that should work for you. Does that help as well?
Shrinivas 33:58.602 Hey, Jack, are you able to hear me?
Jack Zampolin 34:00.283 Yes, absolutely.
Shrinivas 34:01.226 Thank you very much, man. Thank you very much helping me in the forums also.
Jack Zampolin 34:04.170 Yeah, absolutely. Any time, man.
Shrinivas 34:05.828 I appreciate it. It really helps so far. Yes. Currently, I’m using this specific hostname feature because I’m using Ansible to auto-populate on every configuration file. But if that Telegraf natively has Kapacitor available, I just [inaudible] configuration file so that I can remove this empty file in my configuration file, I think. I’ll give it a try. That’s really helpful for me.
Jack Zampolin 34:31.535 Okay. Here, can you see that hostname there?
Shrinivas 34:35.646 Yes, sir. So this hostname is basically you’re talking environmental variable, or you’re talking about every host variable, right? Environmental variable. So which one is populating here?
Jack Zampolin 34:46.047 So this is going to populate the environmental variable. This will force populating the environmental variable.
Shrinivas 34:54.454 Oh, for that Docker environmental variable, not the hostname. Every physical host also has a hostname environmental variable.
Jack Zampolin 35:02.619 So in Docker, in this case, I’m explicitly setting the hostname in the container to the hostname from the physical host.
Shrinivas 35:12.563 Okay. But I’m just wondering because I’m using DaemonSet feature, I cannot search a host name on every container. There’s a single concrete file. I need to take a look into that.
Jack Zampolin 35:21.087 Oh, the DaemonSet feature in Kubernetes?
Shrinivas 35:23.354 Yeah.
Jack Zampolin 35:24.305 So in that case, you would use—here, I’ve actually got an excellent example of that. In that case, you would use the Kubernetes Downward API. You would use the Kubernetes Downward API and just set that host name variable inside the container to spec.nodeName.
Shrinivas 35:53.114 Oh, but again, that’s nodeName probably a random generated Docker hostname, not the actual physical hostname, right?
Jack Zampolin 36:02.309 Spec.nodename is the nodeName from the cubelet, so the cubelet will pull that from the host which is assumedly the correct one.
Shrinivas 36:15.482 Perfect. Can you paste that link in a chat, please?
Jack Zampolin 36:18.598 Yeah, absolutely. And if you’re running in Kubernetes and haven’t checked out the TICK charts repo, I would highly suggest it.
Shrinivas 36:26.912 Thank you very much.
Jack Zampolin 36:27.980 Okay. Next one. Basically, is there a way for Telegraf to automatically [inaudible] hostname. So we talked about that. Not using utmp, is it compulsory? No.
Shrinivas 36:38.125 Yeah. You helped me on the forum because I ran into the same issue because I was unable to monitor the host level files, all the file systems. So you helped me the other day and I was able to export this and run variables from there. I’m not sure I was able to get all the data from the host, but I think you didn’t really give me this utmp so I was wondering whether that is required or not?
Jack Zampolin 36:58.235 Okay. No, there’s not. All right. Mark has another question. Sorry if I missed this, I was only able to join later. What are the benefits, advantages, or arguments for developing a custom plugin for a Telegraf over a custom piece of software that uses the InfluxDB client to inject data points directly into the InfluxDB? So the main benefit of writing a Telegraf plugin is that you don’t actually have to worry about any of the writing to the database stuff. The only thing that you have to worry about is where you’re getting your data from and how you get it into tags fields in a measurement name. And Telegraf will take care of all of the batching, retry logic, common stuff like that, for you. Now, the disadvantage is that you’re going to need to get it merged into Telegraf, or you’re going to need to maintain your own private fork of Telegraf, and both of those things can be difficult. So there are trade-offs there.
Jack Zampolin 38:03.391 My rule is if I’m doing something a little bit more ad hoc, a little bit less repeatable, maybe something that I don’t see a lot of other people doing, I’ve got some standard Go classes that I use to—or not Go classes. I’ve got some Go code that I’ve written to do some of that retry logic with the InfluxDB client, and that’s really easy to do. But if it’s something that’s repeatable that other people might want to do that could be more widely useful, I will write it as a Telegraf plugin and try to get that merged. So does that make sense, Mark? Okay. If you only have a single need for it, i.e. you have a point data source that you need to collect the writing, using one of the client libraries is going to be the right way to go. But the more and more you use Influx, I found that Telegraf is kind of the Swiss army knife in the stack. If there’s some data you need to get from somewhere, it can kind of get in there and pry it out. So thank you very much. Any other questions?
Shrinivas 39:18.711 Yes. I want to give other folks also a chance, but definitely I want to talk to you about my issues with the Docker, man. My InfluxDB is basically crashing every one week, and only due to the Docker plugin. And I want to know, I want to understand what exactly happening and what is it tuning—I’m talking to you in the forum. I’m not getting that satisfying answer for what need to be done to reduce the amount of data flooding to the InfluxDB. Currently, I have only 14 hours in each cluster. We have three Kubernetes clusters, and each cluster is only 14 hours. We haven’t even been to the live [inaudible]. We are planning next month. So even with this small load, only the InfluxDB is crashing. I was [inaudible] I think about a few—
Jack Zampolin 40:04.711 How many containers are you running on top of each one of those clusters? And how often are you turning them?
Shrinivas 40:10.197 I mean, that I don’t have any control because at the [crosstalk]—
Jack Zampolin 40:13.149 Yeah. Absolutely.
Shrinivas 40:14.583 —compared to the node level, we put the 250 as a max container and anytime can burn. I mean, 250 ports basically in the Kubernetes language. But how many times they’re creating this thing are completely not in my control. There are also deployments happening [inaudible] perspective. But not too much of—at this point of time, only few clients have updated the platform. Also, with that a lot. When they generate the InfluxDB, a [inaudible], Docker itself is taking up about 60 to 80 percent of the space. Even I increased the InfluxDB size to 30 GB, and yesterday I again increased to 35 GB, but still my Kubernetes shows that all of the 30 GB is already being used. So I was worrying about the memory [inaudible]. And even I completely excluded all the tasks which you gave me on the forum. So I [inaudible] excluding all that task [inaudible] why I’m getting that much of data?
Jack Zampolin 41:06.406 So it’s collecting a lot of data for each one of those containers. There’s the Docker memory, Docker CPU measurement. It generates a lot of data. If you look at the gross number of points that’s coming out, it is quite a lot. It’s just a standard number for each docker container over each polling interval. And as you turn those containers it’s going to start creating more series. With the way the index is currently designed, it is difficult to handle those sort of [inaudible] series like that. The next release that we’ll be pushing out in the next week or so—don’t quote me on that, probably two weeks. Within the next two weeks, we’ll have a way to keep that index just a recently written series in memory. It will really significantly increase performance for exactly the use case you’re talking about. So it’s an issue we’re aware of, and we do have a fix on the way soon. But long story short, the only way to fix that issue is to reduce the amount of data coming to Influx.
Shrinivas 42:25.174 But how to reduce, because I cannot individually disable the containers because of the huge amount of work. [crosstalk]
Jack Zampolin 42:33.266 One way to reduce the amount of data is to turn the sample interval down on Telegraf. So currently it’s at 10 seconds, move it to 20 seconds. That’s an easy way to reduce the data.
Shrinivas 42:45.435 Oh, I increase to one minute. I know that. I increase to one minute actually because I don’t need the 10-second metrics so I increase to 60 minutes but still I’m seeing the same issue.
Jack Zampolin 42:57.750 Yeah, I don’t know. Let me peek over at that. Community post.
Shrinivas 43:12.084 So that is my top concern. Every week it’s going down. Because before I go for the Enterprise edition, I want to just make sure that there is a stable platform at least too. [inaudible] all the benefits [inaudible]. That’s one concern for me.
Jack Zampolin 43:25.176 Yeah. So from what I’m hearing from you and from what we’ve talked about on the forum, it is just a lot of these tags being pulled in. You’ve excluded most of them, but that container ID is still generating a lot of uniqueness. The answer that I have for you is the high series cardinality index that we’ve entered for the next release will fix these issues. It was designed for exactly this use case, container IDs coming and going, and just keeping that most recent data in memory to optimize performance. So the answer is right now, if you wanted to really prevent that issue, you would need to scale out. You would need to add more hosts to split that in-memory index over more hosts.
Shrinivas 44:25.053 [crosstalk] InfluxDB, right?
Jack Zampolin 44:26.363 Yes.
Shrinivas 44:28.392 But that’s not possible of the total enterprise cluster support, right?
Jack Zampolin 44:31.515 Yeah. And then with the next release, it’s coming in two weeks. We’ve got the time series index and let me draw up a link about that in the notes here. That will really reduce resource consumption for uses like this and prevent your instance from crashing essentially. The way we have the index structured right now, every single tag is in memory as a series key. So every unique combination of measurement and tag set is held as a string in memory, and that does take up quite a bit of space. And especially with these Docker container IDs being very long, these strings end up getting very long and it does take up quite a bit of space in memory. The work to put that index down on disk and only keep in memory series that have been written to recently. In those, your living Docker containers, we will return data on very fast. And the old ones, we’ll keep down on disk for historical archive reasons. So short answer, right now we don’t have a great answer for you. The next release should have something that will significantly help this use case.
Shrinivas 45:48.897 Perfect. But what is the release? Is it 1.34?
Jack Zampolin 45:52.156 1.3.
Shrinivas 45:53.362 1.3 only. Okay. 1.3. I’m already using 1.3. Probably you’re talking about a minor release, right?
Jack Zampolin 45:59.122 1.3 for InfluxDB.
Shrinivas 46:02.421 Oh, for InfluxDB. Okay, perfect.
Jack Zampolin 46:04.274 Yes. And down in the Q&A, I have—yeah, I just dropped that link down in the Q&A as well.
Shrinivas 46:21.382 Perfect. And one more—
Chris Churilo 46:26.408 Okay—
Shrinivas 46:25.637 And one more follow-up question if you guys have some time. Should I go ahead? Hello?
Chris Churilo 46:39.356 With that, I’ll let you go ahead.
Shrinivas 46:41.850 Okay. So the Telegraf InfluxDB is basically a very perfect combination in the container world and the global net world. I think seems to be pretty much happy and satisfied with the results and the metrics and accuracy. Do you guys have any roadmap—I know you guys had the Docker plugin and you guys had the Kubernetes plugin, but the Kubernetes is only—the cubelet, not the Kubernetes API. So we would like to get the metrics on all of the cluster—how it actually is performing, how it reads and writes, and how is the throughput. Like a Prometheus Kubernetes API because I don’t want to use Prometheus again because that’s another component of my stack. I want a native solution from your guys for the Kubernetes API and get that cluster metrics, like quotas or everything. So is there any thoughts on that?
Jack Zampolin 47:33.375 Yeah. Absolutely. So Prometheus is written in Go; we’re written in Go. We actually exported their service discovery module and imported it into Kapacitor, so we can natively now do the Prometheus style scraping through Kapacitor. I’ve got a webinar on that. We’ve got demos for it. And that’s going to be the right way to do that kind of thing. Does that make sense?
Shrinivas 47:59.462 Yeah. But I’m not using Kapacitor because it’s so complex for me. I completely eliminated Kapacitor from my stack, so probably need to take a look into that.
Jack Zampolin 48:07.305 Yeah. And that’s how we’ve decided to offer support for that use case within our stack.
Shrinivas 48:12.289 Okay. But you don’t have plans for Kubernetes API or input plugin, right?
Jack Zampolin 48:20.606 So. We’re going to do service discovery in the InfluxData stack through Kapacitor. Now, I’ve built Helm Charts for Kapacitor and full tutorials for doing this. It is really not that bad. And this is the way that this is going to be supported through Influx. Now, if you need those Kubernetes API stats, you can always pull them using the Prometheus input on Telegraf as well. There is obviously the issue of discovering those. If you want to do the Prometheus style service discovery, you’re going to need to use Kapacitor. If not, you can use the Prometheus input plugin in Telegraf.
Shrinivas 49:01.255 Okay. I’m good. Thank you.
Jack Zampolin 49:06.371 Thank you, Trudy.
Chris Churilo 49:08.951 Awesome. So thanks to everybody for joining us today. If you have any other questions, make sure you drop them into community where Jack and the team can get to them. And we hope to see you at next week’s webinar. We’re actually going to go into some advanced topics with Kapacitor, specifically on alert handlers. So thanks everybody for your participation and we’ll see you next week.
Shrinivas 49:34.323 Thank you very much guys.