Coming soon! Our webinar just ended. Check back soon to watch the video.
How Not to Fail at Data Visualization by Using Icinga and InfluxDB
Webinar Date: 2020-08-04 08:00:00 (Pacific Time)
Icinga provides an out-of-the box integration for InfluxDB. It allows you to leverage the data that is collected for monitoring purposes and store the metrics in your time series database. While collecting data that way is pretty easy, proper visualization requires more attention. Misinterpretation of data is one of the most common causes of wrong conclusions. It make us hunt ghosts during debugging sessions. There are many common pitfalls which we can avoid if we follow some rules. In this webinar, we will show some of the most common mistakes in visualizations and how to avoid them.
Watch the Webinar
Watch the webinar “How Not to Fail at Data Visualization by Using Icinga and InfluxDB” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Not to Fail at Data Visualization by Using Icinga and InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Blerim Sheqa: CPO, Icinga
Caitlin Croft: 00:00:04.581 Welcome to today’s webinar. My name is Caitlin Croft. I’m super-excited today to have Blerim who’s the chief product officer at Icinga here to present on how not to fail at data visualization by using Icinga and InfluxDB. And before I hand it off to him, I just want to give everyone a friendly reminder. Please feel free to use the chat feature or the Q&A feature in Zoom to post any questions that you may have for Blerim and we will answer them at the end. So without further ado, I’m going to hand it off to Blerim.
Blerim Sheqa: 00:00:47.327 Thank you, Caitlin. And welcome, everyone. Thanks for having me. I’m super-excited to do this today. And just to showcase what I have learned regarding data visualization with you, so hopefully it will help you as well. So who am I, and what do I do? My name is Blerim, and I’m working for Icinga. And I started my career initially as a systems engineer. I’ve been working in that field for a couple of years. So from that perspective, I kind of know how it works from the user perspective when it comes to monitoring metrics and visualizations and all of these things. I moved to development first and created a couple of integrations for Icinga back then, before I transitioned to product and finally becoming a product manager for Icinga. Right now, I take care of the whole product line of Icinga which includes different daemons, modules, and keeping everything together so we move forward in a decent way. And I also take care about our partnerships. So we have a global network where we work together with different companies. You can reach me out on Twitter and of course via email if you want to follow up or have any questions that come into your mind after this webinar.
Blerim Sheqa: 00:02:25.239 Before I get started with visualizations and InfluxDB in combination with Icinga, I just want to give a brief introduction into what Icinga is, who we are, and what we do. So Icinga is a monitoring solution. We, the Icinga company, are based in Germany in Nuremberg with the same name like our product. So Icinga monitoring is something that we usually call the Icinga stack because it consists of multiple components that work together to monitor your servers, your applications, your networks, and whatever you have in your data center, be it virtual or an actual data center with physical devices in there. So one basic thing that Icinga does is monitoring the availability and the utilization of everything that you have in your infrastructure. So starting from your physical servers to your virtual machines to your public or private cloud, whatever you have, you can monitor it with Icinga. It includes a couple of automation things. So it has a very dynamic configuration language where you apply rules instead of creating static configurations. So the Icinga way of monitoring would be to create rules like monitor the SSH service on each Linux machine, or monitor the MySQL database on every database machine. So we have this rule-based approach. Of course, we also provide modules and cookbooks and roles and however they are called in different config management tools. We have a pretty good API, which we also use for our own products. But it’s a public API from Icinga which can be used in your custom tools as well. And we also provide automation mechanisms to import data from your existing databases periodically. So when using Icinga, you can use for example, your existing CMDB and you don’t have to re-define all your infrastructure in your monitoring tool. It will keep up to date also automatically.
Blerim Sheqa: 00:05:05.453 At Icinga, we also integrate with other tools, and this is where log management and metrics comes into play. So for log management, Icinga can be integrated with the Elastic stack in different ways, but also with Graylog. So there is an Icinga [inaudible]. There’s a Logstash output. There’s an Elasticsearch writer that writes the performance data. Regarding metrics, Icinga can be integrated and combined with different tools like Graphite and Grafana. But also, most importantly, today with InfluxDB. So the integration for InfluxDB was initially created from one of our community members. So huge shout out to Simon for creating that initial InfluxDB support. It’s been there for, I think, five or six years now. And it’s been working great. So just to give you a quick look how Icinga would look like in your infrastructure, this is the basic web interface of Icinga. So we keep an eye on displaying it clearly what is happening in your infrastructure right now. We believe that a simple view will help you more than sophisticated dashboards. So you can see clearly what is broken before you dig deeper and drive your analysis on your outages and try to figure out what the root cause is.
Blerim Sheqa: 00:06:44.060 To give an overview over the key aspects of Icinga. So I mentioned it already. Icinga is very dynamic. So flexibility is one of our most important things. We have a pretty sophisticated configuration language. So you can create your rules that are highly dynamic that will extend your monitoring automatically, but we also have some other automation features included like importing and configuration management tools. Icinga has a built-in clustering mechanism. So scalability is another key aspect of Icinga because Icinga usually is used by large enterprises with lots of devices. So we need to make sure that it actually scales to multiple locations through the globe. So we have a built-in cluster, which you can use to create high available setups, but also to distribute your load across multiple Icinga nodes, connect nodes from different locations and have a single overview over everything. And the third key aspect of Icinga is that it’s extendable. So we made sure when we initially drafted Icinga 2 that everything that we create is open and can be used by other developers as well. We created some of the integrations by ourselves. Other integrations are contributed by our community members. So extending Icinga, and building your custom monitoring tool, is one of the goals that we have with Icinga, so it can be integrated with your ticketing system, with your log management, with your time series database. So our goal is that you shouldn’t have to replace your existing tools only because you want to put another monitoring tool on top of your stack. And this is where Icinga with InfluxDB comes in place. So looking at it from the top, it works like this that Icinga collects the information of your operating systems independently of what kind of operating system you’re using. It collects the information from your network resources independently from which hardware vendor you’re using, but also from your applications, be it web servers or databases or your custom written applications that run on multiple servers, Icinga is the place that collects all of this data. And in the first place, it tells you if the state of these things and the utilization of these applications, servers, and network hardware is in the state that you want it to be. And it will alert you, of course, if something does not match the states that you define as not problematic. In the same run when Icinga collects the information about the state, it also collects some metrics about your applications and your servers and everything. So it’s not separated in any way. It’s one execution. It will check the state of your application. And it will, at the same time, also collect the metrics for those applications, for example. Icinga will not store the metrics. Of course, we will store the history of the states that you have, but we do not store the metrics because we believe that there are so many fantastic time series databases out there that there is not really a need for us to create one more time series database. So we go the way that we want to integrate with other time series databases, in this case with InfluxDB. So there’s a native integration in Icinga that pushes all of these collected metrics directly into InfluxDB, and it can be started and enabled pretty easy. It’s just one CLI command. And depending on your configuration, there are also some other things that you can do with that integration. So we leverage the HTTP API of InfluxDB, which also enabled us to make sure that we can use authorization of InfluxDB, but also the SSL encryption if required.
Blerim Sheqa: 00:11:36.649 When using the InfluxDB writer in Icinga, you can fully configure the names and tags that you want to use. So Icinga comes with a default configuration for that. But if that’s something that you do not want to use, then if you need to add more information to it like the location of the single part of your infrastructure, the environment if it’s production or testing or even add data about the operating system or whatever else you have on information in Icinga, you can send those information as tags directly to InfluxDB, which makes it easier later when visualizing the data because it’s easier for you to actually filter on the things that are important to you. One thing that is special about the integration with InfluxDB is that you can also enable to send the thresholds to your time series database. So for example, if you’re checking for if a host is reachable, you can also add the data besides of the latency, what the ping takes, for example. You can also add the data about what are my thresholds that I set at which point this check is critical or in a warning state. This is nice if you later visualize these metrics, you can see in your graphs directly, the current value of something, but you can also see the thresholds that you configured in your Icinga, and you can immediately recognize if thresholds have been exceeded at some point. Additionally to that, you can also add some metadata of your checks. For example, state changes, if your service check changed from an okay state into a problematic state, you will see it in your graphs immediately as well. The execution time of something or the latency of a check, this is interesting when especially monitoring the monitor. So what you can do with that is monitoring your Icinga monitor by using InfluxDB to see if there are any checks that have a high execution time where you need some optimizations in perhaps your custom scripts or you need to exchange the plugins that you’re using. Some other key aspects for the InfluxDB integrations are that it has a high availability mode included. So this means that you can enable the InfluxDB writer on multiple nodes of Icinga, and it will automatically take care that a second node will take over if the first Icinga node fails for whatever reason. So in this case, you can make sure that you don’t lose any data even if one of your Icinga nodes breaks.
Blerim Sheqa: 00:14:43.034 I mentioned already, we support authentication SSL, but also we have a buffering mechanism included in the InfluxDB feature. Of course, the buffering will not take so much time that you could stop your InfluxDB server for days. But at least it will help you when you restart anything on your InfluxDB side. So Icinga will take care that the data in the meantime while your InfluxDB is down, is not lost, but buffered on the Icinga node. How large this buffering can highly depend on your service specs, of course. In the end, once configured, so the basic configuration is fairly easy because you just configure your InfluxDB host. If it’s locally, then you can just use the defaults that we provide. You create a database on your InfluxDB that you use for data coming from Icinga. If required, you add a user and the password for that database in your InfluxDB and then you’re good to go basically, and in your Chronograf — this is just a sample picture I took — you will have the data immediately available for filtering and for creating your graphs and dashboards. So here, this is a pretty simple graph, where I just filter for the load of one system. Additionally to that, you could, of course, do much more complex things like filtering for servers in a special location or filtering for operating systems, host groups, or whatever else you have included in your tags that are stored in InfluxDB.
Blerim Sheqa: 00:16:46.115 There are some common things that you should take care of when using InfluxDB with Icinga. And one of them is data retention. So by default, Icinga will write all those data to InfluxDB and InfluxDB will just keep it forever. Maybe not forever, because at some point your disk will be full. But besides of that, by default the data is kept forever. And yeah, so data retention is definitely a topic that you have to keep in mind when using this integration. So the first thing that you would need is a retention policy. So throwing away data that is older than a year, for example, or than three years depending on your requirements makes totally sense here. But it also makes sense to have continuous queries for downsampling the data because InfluxDB will keep the raw data as long as possible, as long as your retention policy allows you to keep it, but to have less data and use less space on your discs, it’s recommended to add some continuous queries within your InfluxDB that select the raw data and downsample it to a different frequency so that you use less space on your time series database. We have a pretty nice documentation from one of our community members in our community forum about this topic where it’s explained in detail about how retention policies and continuous queries can work together with a Icinga and InfluxDB to have the best results and to still keep your data as long as you need it.
Blerim Sheqa: 00:18:54.025 Another thing that you need to keep in mind is the high availability mode that I mentioned earlier. So by default, the high availability mode is disabled. So if you only have one Icinga node, then you can leave everything like it is, by default. At the time where you add a second Icinga node to your setup, you need to take care of the high availability mode because by default, it is set to false which means both nodes, so both Icinga nodes will write data into InfluxDB. Enabling the high availability mode changes that behavior where only one of the two nodes will write actively into the InfluxDB database. But in case that this node fails for some reason, the second node will take over automatically.
Blerim Sheqa: 00:19:56.738 Next, I would like to talk about some common pitfalls when it comes to data visualization. So if you follow these steps that I went through here, you would have an Icinga node that monitors all of your infrastructure and the metrics are stored in your InfluxDB. And afterwards, you will use either your Chronograf or Grafana, or whatever other tools that are out there to visualize your data. So writing that data into the Influx database is the easiest part, I would say. Visualizing this data can be pretty tricky, but not because of the kind of data we have, but just because of some common things that sometimes you just don’t have in mind. And I would like to start with something that I found on the internet, and it’s marked as one of the worst visualizations out there. Here you can see a graph that displays the gun deaths in Florida throughout the time from the ’90s to the late 2010. And according to this graph, if you just would look at it like it is, you would probably think that a mid-2000 — in 2005, during the enhanced stand-your-ground laws, the gun deaths would decrease in Florida and that it actually did help to decrease these numbers. But if you would take a closer look to the y-axis here, you can recognize that they actually just flipped it around. So the minimum value is on the top and the maximum value here is on the bottom. So, yeah, the gun deaths are not actually decreasing, but they are increasing, which you cannot see at the first sight here. So this is something that is done by intention. Obviously, these kinds of techniques are used usually by political parties and by marketeers to set a statement that may not fit with the actual numbers. So you will see these, usually not in your technical dashboards, because these kinds of graphs are, yeah, I would say wrong by intention. And, yeah, we as operators, we usually don’t want to create graphs wrong by intention, but we sometimes create graphs that allow wrong conclusions without any intentions. And there are some things that we can take care of to prevent that, to help ourselves create better graphs, but also help our co-workers to understand the graphs that we created.
Blerim Sheqa: 00:23:12.997 And the first thing, the first rule, I would call it maybe, is to stick with conventions. Conventions in graphing are usually those boring things that you always have set by default, for example, in Chronograf or Grafana. And sometimes we don’t apply those conventions because either we don’t know about it or either it looks nicer if we do it differently, and it’s more appealing to our eyes, and it looks nicer on our TV screen on the wall, but it totally makes sense to keep up with the conventions. I will go through different, yeah, recommendations, rules by showing some examples and then showing also how it can be done better.
Blerim Sheqa: 00:24:06.839 So my first example regarding conventions is the following graph, which I will shortly analyze here. So this is a graph about the load of a server, a very common thing that most of you probably have in your dashboard somewhere. And in this graph, you would come either to the conclusion that the load decreased at some point, or you would come to the conclusion that it’s actually not very high because everything that happens is somewhere in the center of the graph. No very big changes happen, so it seems like everything is okay here. The load changes during the time, but if you have a closer look to the y-axis here again, then you would probably see that the numbers start pretty high here. They started 85 and go up to 120. So this does not allow us to actually conclude what is happening here because it gives us the intention that the load is behaving pretty normal because there are no large changes. Neither are the numbers pretty high because everything happens somewhere in the center of the graph. So the first convention you should follow is definitely setting proper minimum and maximum values because they help to actually understand the actual value of, in this case, load. So setting the minimum value to zero, hot it would be properly done gives us a much more different picture of the whole graphs because now we see that the load is not somewhere in the middle. In fact, it’s pretty high during the time, and it’s also not decreasing so much. It’s staying up there for a long time. So setting the minimums, the proper minimums and maximum values helps you understand what the actual value of your graphs is. And it also makes sure that your co-workers who did not read those graphs, that they also understand that the first sight what is happening here, because in the end, creating graphs, of course, is important for debugging sessions, but also, when you don’t see the changes at the first sight, it becomes pretty hard to conclude what is happening there.
Blerim Sheqa: 00:26:59.028 The next thing I would like to talk about is the comparability of graphs, so comparing values to each other. And the first example I want to show is a graph that displays the memory usage of a host here. This is something that is very commonly used to give a short analysis of this graph is what we see here is memory utilization of a host and the values are stacked. So the sum at the top is the total memory available. But this kind of stacking how it’s used here — it’s kind of useless. So the first wrong thing here is that you cannot actually see how much memory you have left. So usually, you would look for the largest portion of that stack graph, which is in this case, wrong because the free space, in fact, is the light blue line. And you can also not see in the first sight that the free memory is actually getting lower because usually as humans, we interpret decreasing numbers as something good. So lower numbers, in most of the cases, for most of the people mean that lower numbers are better than higher numbers when it comes to graphs as in our infrastructure environment. But in this case, this is kind of — because we cannot actually see how much memory do we have left? Is there enough space left? Is it decreasing? Is it increasing? So we’re having perhaps a hard time to figure out what is actually happening here. It’s not impossible, of course, because after a couple of seconds, latest after one minute, you would figure out what is happening here. But you would have to take a very close look. So the better approach for that example would be to still stack it, to still stack those values, but do it in a different way. So the difference here is what I did is that I put the free memory into the slide background because it’s in a stack graph. It’s some background information that we need. But the more important information is the actually used memory space. So everything regarding buffered, cached, and used space is in the foreground, and we can clearly see that memory consumption is increasing, and we can also clearly see how much space is left. We can see a trend here, and also, we can see at the first sight without thinking a lot, and without having a closer look, within seconds, we can see that there is enough free space left and that there’s approximately around 8 gigabytes total space here.
Blerim Sheqa: 00:30:23.635 So misleading compared graphs are also something used often by different marketers. So in this example, we have an Apple keynote, where they compare the GPU performance of one iPhone model to different other mobile phones. So usually, it’s a trick. It’s by intention. But in our case, we don’t want to trick ourselves, and we also don’t want to trick our co-workers but sometimes we do even without any intention. And one example when we do it is a graph that does not show anything but the line. It is nice looking. It looks pretty neat. If we put it on a TV screen somewhere, our managers would like it, I assume. But the thing is that you only can see changes here. You do not see how big the changes are, and you also do not see during what time this is changing. In this case, we have requests, for example, HTTP requests, we count them. We see a value. It looks good. But we don’t have an actual value that helps us to figure out what is actually happening there. So what is missing here is something that helps us to understand, “Are those many requests, or are they not? Are they more than before? And what time-frame are we looking at?” So clearly, there’s a y-axis missing and x-axis missing, and also the grid is missing. To improve these kinds of graphs, it’s pretty obvious. By adding a grid, by adding an x-axis, by adding a y-axis, it becomes much clearer how many requests we have during which time. And we also can compare easily with the grid, how it can be compared to our — how we can compare the numbers to each other. And we also can see how much the difference between the most bottom versus the most upper value is. It’s kind of boring. I admit that. But it matches common conventions, and it actually helps you and your co-workers to understand what is happening there with your request in this case.
Blerim Sheqa: 00:33:08.865 The last example regarding comparability is a CPU graph. So what we see here that there is a peak somewhere around 11, but nothing more. So what happens here is that the CPU data is merged and averaged and the values that we see help us only to see that there is a difference compared to previously, but we don’t have any details here. And to improve these kinds of graphs where multiple values are merged into one graph is clearly by properly separating them from each other. So especially in the case where you have multiple CPUs or multiple network devices, multiple anything, it’s totally helpful to separate it properly because in this case, as you can see, there’s lots of difference between each CPU. So they are not all behaving in the same way. And it’s clearly visible that the peak is not so high as the first graph intends to show us. And it also shows us that some of the CPUs are basically just idling. So separation is key when it comes to multiple components merge.
Blerim Sheqa: 00:34:47.031 To summarize the comparability things, so proper stacking of values in the way that you can read from them and understand what the values actually mean. Adding grids, y-axis and x-axis, even if it looks boring, but in the long-term, it will be more helpful than the nice-looking graphs, and also splitting into multiple graphs where it’s applicable.
Blerim Sheqa: 00:35:17.739 The last thing about data visualization I want to talk about his readability. The first example I have here is, again, a load graph. But it’s different from the first one that we saw. So here, we can see that the load decreases from time to time for a short amount of time. But what I did here is I added some context to the graph. So first of all, it’s not only showing me the load of the last minute, but it’s showing me the load of the last 1, the last 5 and the last 15 minutes. And it’s also, the graph is showing me context and the way that I edit information about my monitoring system here, because with the annotations that I added, I can clearly see that at the point where the load is decreasing, Dave, in this case, was notified by my monitoring system that the web application is broken. This is information that helps us understand the context of graphs compared to information that we have in our monitoring system to drive better conclusions and to avoid misinterpretation of the data that we are looking at. So here, I use the annotations just to add some more information about because looking at the load graphs, they usually don’t tell you that the web application is broken. But in this case, you would clearly see within few seconds that it has to do something with the web application.
Blerim Sheqa: 00:36:58.019 Another example and this is last one I have for today is huge dashboards. So this is a real-world example that I copied from my co-workers. Actually, this is one dashboard, but I had to split it and merge it again, so I can fit it in one screen. There are many different colors, many different shapes in here. The dashboard is about a huge [inaudible] setup. So it’s used to understand how the [inaudible] environment is working. And so it looks pretty nice, but it only helps if you’re really, really trained into using these kinds of dashboards. So if you’re not used to exactly this dashboard, you will just be overwhelmed. So these kinds of things only help those people in your organization that are trained to use exactly this dashboard because their looks will go immediately to the single places that are important to them. But you will have to train those peoples to actually use a dashboard which you can avoid by just splitting it into multiple dashboards, linking dashboards to each other, and avoiding all those tasks of training people to actually use dashboards because I think that is not the intention of dashboards to have people trained for that.
Blerim Sheqa: 00:38:37.148 How you can improve that? So the first thing I would recommend is to actually ask yourself, “What questions do you want to answer with your dashboards?” So using single proposed dashboards, you will need to take more work into creating all those single dashboards and link them with each other, but in the long term, it will definitely help you to get started faster, especially when there is a problem in your infrastructure. Additionally, of course, it improves the performance of dashboards because those large dashboards, they also usually take a long time to load, time that you perhaps don’t want to take when something is broken, and it will decrease the complexity of the overall thing. Infrastructure itself is complex. Your dashboards shouldn’t be. And also add related information like, “How is the state of my application at the time where I look at the graph? Are the thresholds that I configured exceeded? Are there alerts sent out at a certain time?” And yeah, just use the information that you already have in your databases and add it to your dashboards to add some context and to draw better conclusions and to avoid misinterpretation of the data. And in the end, make it readable for everyone in your organization because in the end, also the creators of the dashboards at some point won’t be available, so it should be readable by everyone else in your team.
Blerim Sheqa: 00:40:35.887 One last thing I want to recommend here in this webinar is that having knowledge about your data is, in my opinion, the most important thing when it comes to collecting metrics and also visualizing or especially when it comes to visualizing those data. It’s absolutely necessary that you know what data you are collecting and understand each and every metric that you are collecting so that you can actually use it later for debugging because if you don’t understand what you collect there, you’re going to have a hard time to use the data for debugging anything. So quality over quantity. I believe that it makes more sense to pick data that you actually understand instead of just collecting everything that any tool can give you. This includes Icinga, but also any other monitoring tool or collector out there. You need to learn to trust the data. So actually understanding where it’s coming from, what it means, how your collector works, and also know details about the data regarding at what pace you collect that? How long you store it? If you downsample it, what does that mean to your graphs later? And also answer the questions about who actually needs the data because if you just create a data graveyard, it will be exactly that in the long term.
Blerim Sheqa: 00:42:22.596 In the end, the best dashboards cannot help you to find an issue if you do not understand the underlying data that you collected before. When we think we know the data, but we don’t actually know them, it leads us usually to building wrong graphs, and this again leads us to building wrong dashboards and this again leads us to misinterpretation of the data and, in fact, just wasting our time by hunting ghosts somewhere in our dashboards and graphs. So invest more time in the beginning, and it will help you in the long term. Thank you very much. I will be here available for some questions now.
Caitlin Croft: 00:43:14.935 Thank you. That was a great presentation. I loved your shout-outs to your community members. It’s always fun when you find those, those developers who just love contributing to the open community. It looks like we already have a few questions. So the first question is: Icinga similar to Nagios?
Blerim Sheqa: 00:43:37.164 So in fact, where do I start? It is similar in some way. So initially, when we started the Icinga project in 2009, it was a fork of Nagios. We moved forward in 2012 by rewriting everything and releasing Icinga 2 as the successor. So basically, Icinga 2 is compatible with the plugins of Nagios. But it has a different configuration language which is more flexible, and it has a built in cluster, and it’s extensible in a different way than Nagios is, but yes we can use and we do use plugins that also can be used by Nagios.
Caitlin Croft: 00:44:30.012 With Icinga buffers, can we intentionally buffer for compression when sending the data to InfluxDB? For instance, time series data will have decent run length compression and maybe a 1-minute buffer of 600 samples, may be a bit larger than sending a 10-second buffer when both are compressed.
Blerim Sheqa: 00:44:58.244 I’m not 100% sure if I understood the question, but basically what you can configure regarding buffering is the flush interval, not independently for each metric. It’s a global setting and also the amount of metrics after which it will be flushed.
Caitlin Croft: 00:45:20.466 Okay. So Tony asked that question. And Tony, I just allowed you to talk if you want to un-mute yourself and maybe expand upon your question. All right. We’ll move along. But Tony, just let us know if you want us to go back to your question. Can you show — he had a phone call come in. Can you show a demo about comparability — like how to start in the beginning and how to do the comparison?
Blerim Sheqa: 00:46:00.592 I did not get that, sorry.
Caitlin Croft: 00:46:03.571 It’s okay. I guess they’re just more interested in learning a little bit more about comparative ability and how to start with it, just understanding the process a little bit more.
Caitlin Croft: 00:46:17.326 Yeah, let me go back to the slides. So it highly depends on the types of graphs that you want to create. So in my examples, I used some very common things like CPU, load, and memory consumption. But in the end, it highly depends on what you actually want to visualize because depending on the data and what you want to see, or what questions you want to answer, you either use techniques like stacking or just plain graphs where we just — plain graph with y and x axes in a grid or you split the graphs into multiple graphs or you even use a completely different type of graph like a bar chart. But it really highly depends on what you actually want to visualize and what you want to get out of the visualization because even stacking the memory, in some cases, may be not the right thing for your case, because you want to get different information out of it.
Caitlin Croft: 00:47:35.059 And for the person who asked that question, you should probably have my email address and if you want me to connect you with Blerim after the webinar, I’m happy to do so, if there’s a little bit more that you wanted him to go over. What would be the best approach to having graphs according to what everyone could need, for example, a DBA or a sysadmin? So just kind of wondering best practices on making sure that everyone has the graphs that pertain to them, that are most interesting to them.
Blerim Sheqa: 00:48:10.329 Yeah, so in my experience, the thing that makes the most sense is by talking to the people who actually use those applications. So if you’re, for example, monitoring databases, having a call or a meeting with your DBAs and understanding their needs first before you start building anything regarding monitoring is the best approach that you can do. It’s of course time consuming. Sometimes, people don’t take the time to tell you what they need. But in the end, it’s the most helpful thing to try to understand what the users of the applications need to know and what are the states that they think are problematic?
Caitlin Croft: 00:49:07.371 Perfect. Okay. Another question. What would be your recommendation, specifically around InfluxDB, for representing deltas in metrics that fluctuate over time? For example, the requests that you showed.
Blerim Sheqa: 00:49:30.507 Can you repeat that question?
Caitlin Croft: 00:49:34.563 Sure. So I think they’re asking the best way to show changes in your data as it fluctuates over time.
Blerim Sheqa: 00:49:45.623 Oh, got it. So the best way is, of course, setting correct minimums and correct maximums. So even in this graph, you can see that the minimums are wrong. In some cases, it also is makes totally sense if you have a maximum to also set the maximum to the proper value. So if you know, for example, that your web server will take only a certain amount of requests, then this certain amount should be your maximum. So you can clearly see if you are hitting the top or if you are somewhere on the bottom. What helps here, so if you’re Icinga in combination with InfluxDB is sending the thresholds from Icinga directly to InfluxDB as well. In that case, you would have to set the thresholds that you think makes sense for your amount of requests in Icinga. So Icinga will alert you once the thresholds are exceeded. And at the same time, you will have that horizontal line in your graph as well where you can see where the maximum is and if you’re getting closer to it.
Caitlin Croft: 00:51:06.252 Perfect. So if anyone has any further questions for Blerim, please feel free to post them in the chat or in the Q&A box. Thank you, Blerim, for answering all those questions. Oh, another question. Is Icinga similar to Telegraf?
Blerim Sheqa: 00:51:26.874 Well, no. So Telegraf is — so yeah, depends on how you compare it, of course. So both of them collect data. That’s true. But the goal of Telegraf is more like it’s focused on collecting metrics, whereas the focus of Icinga is figuring out if everything is working properly as you defined, but in the same turn, also collect the metrics. So Icinga, it’s not just only a metrics collector, where the focus of Telegraf is clearly on metrics only.
Caitlin Croft: 00:52:11.590 So perfect. And yes, another reminder of InfluxDays, North America coming up like I mentioned. Even though it is technically our North America edition, it is open to everyone. So we’d love to see people from around the world join. It’ll be a really fun event. Call for papers is still open. So if you want to share how you’re using InfluxDB or any information in the DevOps or observability space, please feel free to join. Next Wednesday, we have our Community Office Hours, which will be as always a really great event. So I’m just putting that in the chat. And it looks like we got another question. Just a little bit more clarity. So since Icinga plugin most of the time only has alerts for data that crosses certain statistic thresholds perhaps hard-coded, not data that changes at a certain rate. Is there a way to represent that fluctuation, maybe using Kapacitor so that the monitor and operators aware of data that is increasing or decreasing too fast?
Blerim Sheqa: 00:53:32.541 So, yes, in that case, you would have the possibility to instead of checking against only the current data, instead, you could use the data that you have stored in your InfluxDB and query InfluxDB and run the query and depending on the result, Icinga will alert you to — yeah, depending on what your query returns.
Caitlin Croft: 00:54:05.621 Doan, I just allowed you to talk. If there’s any more that you want to expand on your question, you should be able to un-mute yourself and speak if you would like.
Attendee: 00:54:20.309 Okay, can you hear me?
Caitlin Croft: 00:54:22.100 Yes.
Attendee: 00:54:24.909 Okay, thank you. Let me introduce myself first. I am a DevOps engineer, and we work with InfluxData and Icinga a lot for our telco monitoring system. So for the networking equipment, there’s a lot of data that the operator only interested in the chain of the data over time, like the bandwidth for example. They do not care if the bandwidth is over 800 megabytes or something like that. They want to know if the bandwidth is increasing too fast. You get the idea? And our current problem with InfluxDB and Grafana is that when we use a [inaudible] function in the InfluxDB, most of the times the operators, they do not understand the reason of the algorithm. And so I want to ask even from [inaudible], is there any good way or any best practice so that the NOCs operator or the monitor can understand that the graph we are giving them is representing the change in the metric over time, not just the current value.
Blerim Sheqa: 00:55:45.843 Yeah, so if I understand the question correctly, it’s about that your network operators care more about changes in bandwidth than the absolute bandwidth. And what I would do there is instead of monitoring the current values of the bandwidth, instead, I would monitor by using the data stored in InfluxDB to query that data for, let’s say, the last 10 minutes or the last 5 minutes, whatever suits you. And depending on the result, you can use the InfluxDB query language to also calculate a little bit, so you can get the difference out of that, and depending on if the difference between the lowest and the highest value is too high, then you send an alert out. So it’s kind of a very minimalistic anomaly detection thing.
Attendee: 00:56:52.464 And normally to do that, we need Kapacitor, right?
Blerim Sheqa: 00:56:58.220 So either Kapacitor, of course, or by using Icinga with a check plugin that queries your InfluxDB.
Attendee: 00:57:09.528 Oh, okay, okay, I see. Thank you.
Blerim Sheqa: 00:57:12.382 You’re welcome.
Caitlin Croft: 00:57:14.055 Great. Thank you so much. So this webinar has been recorded, and it will be available for replay afterwards. And the slides will also be made available on SlideShare. So thank you everyone for joining today’s webinar. Thank you Blerim for presenting. I thought it was a fantastic presentation. And I hope to see all of you next week at our next webinar as well as next week at our community office hours. And I threw a link to the Community Office Hours in the chat so everyone should be able to find it and register. Thank you, everyone, and I hope you have a good day.
Blerim Sheqa: 00:57:56.207 Thank you very much for having me.
Blerim is the CPO of Icinga, an open source monitoring solution. He spent years in systems engineering before moving to development and then to product management. He takes care of the Icinga product line in general. He's a committer and maintainer of various open source projects with a passion for automation.