Modern vSphere Monitoring and Dashboards Using InfluxDB, Telegraf and Grafana

Session Date: Apr 28, 2020
Time: 8:00am (PT) | 3:00pm (GMT) | 4:00pm (BST)

In a modern data center, we find ourselves surrounded by a lot of metrics and systems that require attention, monitoring, alarms and event management.

In this webinar, InfluxAce Jorge de la Cruz will demonstrate how to create beautiful, and meaningful dashboards from vSphere’s most critical assets like hosts, VM’s, clusters, and data stores. The best part is the whole monitoring system can be deployed in seconds using Docker, and it uses the vSphere SDK, which makes it non-intrusive and very efficient. Discover how to utilize this cost effective monitoring and visualization solution for vSphere environments!

Watch the Webinar

Watch the webinar “Modern vSphere Monitoring and Dashboards Using InfluxDB, Telegraf and Grafana” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.

[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]

Here is an unedited transcript of the webinar “Modern vSphere Monitoring and Dashboards Using InfluxDB, Telegraf and Grafana”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.

Speakers:

- Caitlin Croft: Customer Marketing Manager, InfluxData
- Jorge de la Cruz: Systems Engineer, Veeam Software and InfluxAce

Caitlin Croft: 00:00:06.459 Alrighty. It’s three minutes after. Hello, everyone. My name is Caitlin Croft. I work here at InfluxData. I’m super excited to introduce Jorge de la Cruz. He is one of our fantastic InfluxAces, which means that he is very comfortable and very familiar with InfluxData’s technologies. And today’s webinar is on modern vSphere monitoring and dashboarding with InfluxDB, Telegraf, and Grafana. Just a little friendly reminder, please feel free to post any questions in the Q&A box or the chat window, whichever works best for you. We are monitoring both. So feel free to throw any questions you have in there. This webinar is recorded, and it will be available after the webinar. So without further ado, I will hand things off to Jorge.

Jorge de la Cruz: 00:01:04.947 Hi, guys. Just to confirm that - can you hear me properly? I guess -

Caitlin Croft: 00:01:12.833 Yeah.

Jorge de la Cruz: 00:01:12.909 - that’s yes. Okay. Perfect.

Attendee: 00:01:14.786 Can hear you. Yeah.

Jorge de la Cruz: 00:01:15.997 Really good. Really good. Good morning, good afternoon, good evening depending on where you are. So let me move to the next slide. So what are we going to learn today is just really a couple of points really, monitoring 101, vSphere monitoring, InfluxDB, Telegraf, and Grafana. And then the last point is about quickly mention about learn from the community and build for the community. So that’s me. And yes, quick introduction about myself. I’ve been blogging in jorgedelacruz.es and jorgedelacruz.uk. But in the dot es, because that’s my native language as you can see here by my accent, I’ve been blogging since 2011. So it’s a lot, a lot of years blogging and creating that content, mostly around vSphere, VMware, some Veeam, open source, a lot of open source there on the blog as well, mostly about Zimbra back in the days and now lately more and more about Grafana, InfluxDB and Telegraf because that is one of my favorite technologies out there for monitoring.

Jorge de la Cruz: 00:02:34.147 I actually have more than thousand articles probably. So I just need to update the slide. But yeah. I’ve been blogging a lot of time, and I created a lot of content. Really good visits last year. And yeah. I’m married and I’m a father as well, so I’m really, really happy. And I guess everyone who is a father understands what really it’s - not having much time to do all these things anymore. So yeah. I guess I’m just thankful to my family. I’m a vExpert and recently an InfluxAce as well. The InfluxAce, it’s a program that Influx has for the most kind of - for the guys in the community that build stuff around the community and spend this extra time outside work or even certain things that they do at work with the Influx technologies. So I guess, I just encourage you to take a look into the InfluxAce website and see the different profiles, ping some of the InfluxAces because everyone has different skills. For example, today I’m here sharing about VMware, vSphere and Grafana, but they have a lot of other InfluxAces which they are really good in Perl - which they are really good at different technologies. So let me pass into the next slide.

Jorge de la Cruz: 00:04:00.157 And monitoring 101, there are two different kind of persons. We probably know these two different persons. So they are the ones that have monitoring. They have been really proactive, so they have some sort of monitoring or the ones that are attending today to get a better monitoring. And they are as well these other people. These other people, they’re surrounded by fire on the data center or whatever they need to monitor. And they just say, “I don’t really care much. It’s fine. Everything is fine.” But it is really not. I mean, it’s - the session that we’re going to see today is all based in open source technology, so it’s kind of no excuse really to start monitoring, in this case VMware vSphere today.

Jorge de la Cruz: 00:04:49.890 Okay. Let’s start with SNMP, right? So Simple Network Management Protocol. But again, this is a really old, old protocol with - as old as the internet is pretty much. It just really started long, long time ago. It still works, still perfect for some use cases, especially for some of that legacy hardware, for example, firewalls or some sort of applications that they cannot expose anything else than SNMP. SNMP, Simple Network Management Protocol, but I do not consider it really simple at all. We’re going to mention a couple of slides to prove you that is not simple at all and is actually like a really, really high learning curve to start with SNMP properly speaking.

Jorge de la Cruz: 00:05:41.498 So in the case that you keep using SNMP, which is okay, a couple of things to bear in mind, please. So while using it, please use v3, so the latest kind of version of SNMP, which is more secure. It has encryption and so on. So please do it. Please obtain as well the most up-to-date MIB from the different vendors that you’re using it because they will have like new codes?]. It will have like new OIDs or new paths, new routes that are going to help you to monitor either new software or new piece of hardware that you have, or as well add some new functionalities itself. And please, please, please try to avoid public SNMP community. It’s a lot of people out there with strange intentions. And if you are exposing SNMP - I mean, you’re exposing public community without security, you will probably get, yeah, some problems in the future because you are exposing so much information. SNMP, you just throw out a lot of information out there. So it’s much better to keep it secure. And please do not use public community.

Jorge de la Cruz: 00:06:58.287 So let’s talk about the SNMP configuration, for example, for VMware VCSA, which is the vCenter Server Appliance. So if you start with VMware today and you try to get SNMP working with some legacy tool, and then, “Okay, yeah, I can do it with SNMP,” probably the first comments you will run are those words that - they’re on this slide at the bottom. Please don’t do them because they’re just using public community, and they’re using as well just kind of SNMP version 1.0. So please, please, please don’t do it. The right way to protect the SNMP of the VCSA, you can find it on these QR codes. You go in the mobile phone, which I’m sure you have handy somewhere, just point it there. And we will send the slides after. You will be able to download them. It’s just a hyperlink. When you click there, you will get to the proper SNMP version 3 configuration for the vCenter Server Appliance - someone at the door. Exactly the same for vSphere ESXi, right? So for vSphere ESXi, the first thing you will think when you have the ESXi, maybe brand new, maybe because you just joined a team and they don’t have any monitoring, it’s like, “Okay. Well, let’s enable SNMP.” And those are the comments that you should not run really. I just put them here just to make you familiar with them really. And please don’t run them. And again, QR code to the proper official link of VMware to configure this using SNMP version 3.

Jorge de la Cruz: 00:08:39.662 So okay. Moving forward, let’s talk about why SNMP is really, really, really complex. For example, SNMP, it has - the vendors can have what is called a MIB. And really a MIB is just - at the end, it’s a management information base. It’s kind of like a path where to find a specific metric. So for example, CPU, for example, RAM. So what’s the CPU of this VM, etc.? Just an example but exactly the same for anything in the world like servers, switches, everything. So you can see there everything that starts in root. Then it moves on there, ISO, which is one. Then it moves on there, identify the organization, which is three, DoD, internet, management system, and then the system description. That’s how the MIBs usually kind of look like. So the MIBs are these - somebody on the vendor itself, they really produce these routes for you. So for example, the ones from VMware. So VMware, they have already the 1.3.6.1.4.1 and then the vendor name, which is sort of the vendor ID, which is 6876 from VMware, and then different things. For example, for the notifications, it’s the dot zero. For the system, it’s the dot one. For the virtual machines, it’s the dot two. And then inside that dot two, there are a lot of others numbers over there. That’s complex. If you ask me, I will say this is really complex, and this takes a lot of time and a lot of knowledge and a lot of digging, wasting time on the internet to kind of find all of this. At least VMware put the MIB, so a couple of that stuff, they will be already there, which is good news.

Jorge de la Cruz: 00:10:33.065 To download, same as I said before. We’re talking VMware today here, so VMware vSphere. VMware vSphere, they’re doing a pretty good job, decent job keeping the SNMP MIBs up to date. So that’s for 6.7. But I checked the other day, and I found for 7.0 already because that’s the latest version. But still, if you find for the vSphere version that you have, you will be able to download a different SNMP MIBs. So again, if you keep using this legacy protocol - I like to call SNMP - please, please, please follow these best practices and please download the latest MIBs.

Jorge de la Cruz: 00:11:14.535 Okay. So how does it look an snmpwalk example? Look at this. I mean, this is not human readable. It’s so complex. Some of the stuff you can understand. For example, the sys description, okay. It’s the VMware vCenter Server Appliance, blah, blah, blah. Okay. But others, they do not make much sense at the end. So I will say that this is really complex. If you want to start monitoring something, this is not going to be the most friendly way to start monitoring anything with the SNMP. So some things, you are going to understand them. Some others, you are not going to understand them. The uptime, okay, what does it mean? It’s one second all the time. What does it mean really? Much better to really move forward, keep watching the session because we are going to see a super simple way to start monitoring your VMware vSphere. When I say simple, it’s really simple, just three steps pretty much.

Jorge de la Cruz: 00:12:11.603 So that’s our next topic. So that’s ourselves on the rocket going to the moon really because we are going to see now how we can do a better vSphere monitoring, simple one and really, really fast. So what’s the most critical things that you would like to monitor? Probably it’s the CPU usage, either by the cluster, the host, the VMs as well, of course. And I’m pretty sure you will like to see them in percentage and probably megahertz from time to time. It depends. On my particular case, I’ll show you today percentage because at the beginning when I did all of this, I was putting it on megahertz, but people told me, “I don’t really care if it’s consuming like 200 megahertz or something. Much better if you tell me, ‘Okay, that’s the 3% or that’s 15%.’” But the benefit of this is that you can change it with just one click, so it’s super simple.

Jorge de la Cruz: 00:13:07.378 RAM usage, exactly the same. Probably you will like to get the consumption of RAM and cluster of VMware, probably per host as well and per VM in gigabytes. That’s the most likely that you will like to get; and percentage because once again I get the feedback with time that - people were telling me that, “It’s much better if you just can show the percentage instead of gigabytes.” Again, it’s super simple to retrieve both. Datastores usage and percentage, capacity, some VMFS reads, some writes, some IOPS, latency, that’s quite interesting to get from VMware and the per VM stats as well just to monitor absolutely everything that we did mention but per every single granular VM that you have inside your vCenter.

Jorge de la Cruz: 00:14:02.725 So now that we have all of that clear and that we discussed about the complexity of SNMP and how really legacy, there’s a better way to do all of this, which is using InfluxDB, Telegraf, and Grafana. So InfluxDB is the open source times series database. We will say the different metrics that the Telegraf agent writes on it. The Telegraf is the plugin that the agent that is going to download everything from the vSphere SDK. And then later on Grafana is the dashboard platform that I’m going to use for this specific purpose. I know that you can use Chronograf as well, no problem at all. If you like better Chronograf, you can just replicate what you are going to see today. So the InfluxDB, Telegraf, that’s perfectly fine. And then you put on top of that - you put on top of that Chronograf.

Jorge de la Cruz: 00:14:58.958 So let me again show you kind of the diagram, super, super simple. So the different components, remember the Telegraf there in purple - the agent retrieving all the information from the vCenter. So that’s all the metrics. It does it using the SDK, so kind of API-driven. It doesn’t need any agent on the vCenter. It doesn’t need anything. So the Telegraf itself that you install for - in my case, I have all in one. So I have InfluxDB, Telegraf, and Grafana all in one. That’s the one that is retrieving everything from vCenter all the time. So that’s quite simple. Telegraf, it has so many other plugins, so many ready to start pulling data out from different things. I don’t know, at the top of my head: Maria DB, Apache, NGINX, LDAP, Postfix. They’re so many. They’re so many. Just go there and put on Google Telegraf inputs, and it’s a massive list with more than 100 - I’m pretty sure they’re more than 100 at the moment and more that they keep adding version after version of the Telegraf, really powerful.

Jorge de la Cruz: 00:16:15.058 Then we take all that information, and then we save it. Where do we save it? In MySQL or MariaDB? No. Of course not. Those systems are not built to save - I don’t know - thousands and millions of metrics, not really. So we save it into InfluxDB, which is a really robust database, real-time time series database. So we can put there up, yeah, a lot of this information without any problem, millions, millions of metrics. One point to discuss here is that, for testing purposes, it’s good if you put it in one VM altogether, okay, to start with. But if you want to escalate this to the much better professional level, I will recommend it to you to size the InfluxDB properly, the server, the virtual machine or wherever you put it, just size that InfluxDB properly, maybe leverage the InfluxDB enterprise version so you can do HA or even using the Influx Cloud; that InfluxDB part, it’s already you consume it as a service from Influx. You can start with it from the basic on the small VM, and then from there, you can always move to the next level.

Jorge de la Cruz: 00:17:27.052 On top of all of that, we have the Grafana, which is the dashboard technology that I’m using to just show all the metrics that I’ve being consuming and saving in InfluxDB, just to show them in a really beautiful way and, of course, meaningful way because it can be beautiful, but if it doesn’t tell you anything which is important for you, then it’s no point, right? So that’s a simple diagram. How can you start - how can you start - sorry, I got there - start monitoring this today? It’s really, really, really simple. You can start monitoring this by enabling the telegraf.conf with the next - on the next slide I’m going to show you. You just enable the input from vCenter, and then suddenly you start downloading absolutely everything. You have as well VMware. They build by themselves something which is called vSAN Performance Monitor and is an OVA, which is a virtual image. And that includes already InfluxDB, Telegraf, and Grafana with extra dashboards for vSAN but under VMware itself - everything is pre-configured. So it has downloaded, changed the username and your IP of your vCenter, and then you have everything then.

Jorge de la Cruz: 00:18:43.525 What’s the next step? It’s downloading the ready to consume dashboards that I built over these years. So this is the configuration in Influx - sorry, in the telegraf.conf. Super simple, you just search by vSphere on your configuration, or you create another configuration file called vmware.conf. I do personally prefer to always create a separate file and save it in the telegraf.d. Yeah. Do not edit the telegraf.conf directly because that’s kind of the master configuration file, and I don’t really like to edit it. It’s much better you just take these two pieces and put it in a file called vmware.conf or vsphere.conf, super simple. As you see, the inputs.vsphere with the vCenters - so you just put the vCenter there or vCenters because then you have multiple, the username and password. And then it’s kind of two ways to retrieve information, and I’m going to touch that in the next slide. It’s the real-time instance which - well, what VMware calls real-time and the historical instance. So for the real-time instance, you just need to put it like this way with these intervals and so on. And for the historical instance, the interval is a bit higher because vCenter, they’re not giving us the information much faster or sooner than 300 seconds. So really this is a proper, good configuration, and this is a configuration that the guys that develop Telegraf itself, they’ve been working with VMware to obtain these metrics itself.

Jorge de la Cruz: 00:20:30.644 So let’s talk about this, how the vCenter stores the metrics, right? So there are two ways. One is the real-time, which is not real-time. You can see that these metrics, every 20 seconds, right? Because it’s in every 20 seconds, you can go down - on the real-time configuration before on that file, you can go down to the 20 seconds to match what VMware is giving to you. But you cannot go down to 5 seconds because VMware is not going to give you that. So what VMware calls that real-time is including things from the ESXi and VM, so the metrics such as CPU, RAM, and the networking. And for the other items, which they are the datastore cluster and data center metrics, they are every 5 minutes. So again, you do not need to - even if you query that sooner like, okay, just give me the metrics every 30 seconds, you are kind of doing queries that they cannot give you anything. So you’re wasting a lot of cycles doing all of that to not retrieve anything. So as I mentioned before, this is how the configuration should look like. Just copy-paste this on a new file, and this works out of the box. So that’s super simple.

Jorge de la Cruz: 00:21:53.594 And then let me move into this slide. Wow, one of my favorite. So this is demo time now. So Caitlin told me the other day that I should kind of maybe have slides in backup just in case, but I just like to live dangerous - like the guy with the fire. So I’m going to do it in real time and see if I have my Grafana over here and if everything works. So these are one of the dashboards that you can - I guess that you can see the screen and everything. These are one of the dashboards that you can download for free and import it on Grafana. And if you have configured the Telegraf part, this start works out of the box. Here on the vCenter server, you will see your vCenter servers. On the clusters, you will see your clusters. On the ESXis, you will see your ESXis, datastores, VMs, and so on. All of this works out of the box, which is really, really, really great. This took me a lot of time when I was building this at the beginning. And thanks to the community feedback, I’ve been adding some more stuff, removing some things and so on and so forth.

Jorge de la Cruz: 00:23:03.805 So if I go over here, this is over the last - this is over the last 6 hours. Then under More Dashboards, I have more dashboards over here. So I have the host. I have the IPMI, the VMs, the datastore. Let’s see if we can open a few more over here. Probably I’m going to show that, the VMs and the datastore. And all of these dashboards, they are already there. So that’s the benefit. When you download all of them together, they will be put all together in kind of a folder, which is going to be called VMware, and you can see all of them over there. So that’s neat. That’s really powerful if you ask me. Let me show you what we have over here. So this is the last 6 hours, but you can probably go under the last 24 hours. And you can scroll a bit. You can see that these datastores’ usage capacity and this one is the same., but I know people like to see them differently from time to time. So I just added the speedometer, and I added as well this new fancy LCD from Grafana that they added in version 6.0, I think. So I really like this new kind of LCD way to show it. It reminds me kind of the Back to the Future kind of - yeah. Or the DeLorean, some of that. You will see the hypervisor status over here with your two hypervisors or the hypervisors that you select at the top. That’s the benefit of having things on the top that if you change things, let me prove you - so if you change things over here, you can see the panels kind of refreshing themselves. If I go and pick now just one, things change in real time because, well, that’s how everything is built in here, right? Let me put all. You can see that the diagrams change over here, the hypervisor. Same for the VMs.

Jorge de la Cruz: 00:24:59.625 This is kind of a quick, quick, quick, quick overview about the whole VMware vCenter itself and hypervisors and VMs, kind of all together. So this is a nice, good view. And then if we move into the next one, you will see kind of a dashboard per host. So once again, if you just go there and then - okay. I just want to see this ESXi because it’s giving me a lot of problems, a lot of troubles or is having a big use of CPU, so over the last - I don’t know - 24 hours. And this is the magic. Yeah. You can see when I change between the dates, it’s not really Grafana. It’s really who are in my InfluxDB in real time and giving me this power - yeah, giving me this query within seconds. I mean, it’s insane the speed of this. It’s just so, so fast that you can move between days, and this thing just takes nothing. It’s just a lot of metrics here. It’s probably metrics - every 20 seconds or something. So it is really -well, this also is every five minutes. So yeah. It’s really powerful.

Jorge de la Cruz: 00:26:06.076 This is IPMI. It’s related to all of this, right, because you want to keep the control of the hardware of those ESXis. This is not using the vSphere input. This is using another input that Telegraf has by default which is querying the IPMI or the iLO and all of that stuff from the server. So it works out of the box. So you just need to install the IPMI tool, which probably you have it under Linux anyways. And then again you download this dashboard. And you will be able to see your temperatures over here over time and so on. This did really, really help me, especially over summer because I have all the servers over here, just on my left here. And I do not want to get them overheat. They’re quite expensive, so.

Jorge de la Cruz: 00:27:03.144 And this is the per VMs one. Probably this is much useful if you just select the VMs that you want to look instead of all of them because I can see that you’re not going to scroll and then see - I don’t know - 100 VMs over here. It doesn’t make any sense. So just for the one that you want to see - okay. How is the Active Directory doing? Okay. In megahertz, in percentage. This is what I mentioned to you. So this is quite powerful. What else do I have over here? Sorry. I just clicked there, and then I cannot go back into this, right? Okay. And this is the datastore, which is just at the moment the total capacity, free capacity. I need to rework a bit on this dashboard. It kind of looks great, but it looks a bit empty. It’s much more that we can do with the datastores. But at the moment, this is what you will obtain. It’s still really quick and really visual to see - you see the donut over here and how full is the donut itself. So it’s quite visual to understand anyways how is your capacity.

Jorge de la Cruz: 00:28:03.330 So let me quickly touch because I have a couple of slides here. So after you download and install Telegraf either from the repos or from here, you download the Telegraf. You install it. Super simple, just directly from the repositories is going to be my [inaudible]. You just download it. After, you just go to the grafana.com, and you search by VM what you would see it. But anyways, here is the number. And you could see that this has already 5,000 - almost 500 downloads, almost there. So it’s a lot of downloads. A lot, a lot of downloads. And I like to think from time to time that these are in data centers. Imagine if this dashboard is in 5,000 data centers. But I guess that there are a lot of people like me that they have like home labs, and they’re trying this on their own testing environment. Still, this is really great number, and I’m quite pleased with this number of downloads to be honest. So you just download it from here, copy to clipboard or take this and then import it in VMware and - sorry, import it in Grafana, and that’s it. Super, super simple. If you get some comments, just go into the reviews and put it over there or ping me, yeah, by any channel itself.

Jorge de la Cruz: 00:29:23.445 So I just wanted to show you the vSAN Performance Monitor once again. It’s an appliance, an image that you can download, and it has everything together. So it has already the Telegraf. It has the InfluxDB. It has Grafana. And because this is VMware, it has three extra dashboards about vSAN itself or VMware vSAN in the case that you’re using it. The good thing with this is that you download this and configure the IP of your vCenter with your credentials. And then later you just import my dashboards, and you will have the full pack. You will have all these free dashboards that you can see here, and you will have as well the vSAN, which I do not have any vSAN just yet because it’s kind of in another Telegraf branch, blah, blah, blah. It’s too complex to enter into that. But it’s simple to download and deploy this within seconds.

Jorge de la Cruz: 00:30:26.971 If I go over here, what’s the good thing about this is that you can edit everything that - if you don’t like something here, if you don’t like the RAM usage like this and you do prefer like that, you can do that or put it like that. This is the good thing with the open source, right? And this dashboard needs to be meaningful to you. It needs to be useful for you. Do you want to add your logo? That’s great. Do you want to change the RAM usage because you don’t like the orange and you prefer another color? That’s great. Just go here and then put, okay, I want - like the warning needs to be blue. Okay. Just do it. It’s going to be your monitoring tool. It’s going to be - all of this is going to be kind of your baby. So just do whatever is more useful for you, remove what you do not want, add anything that you want to add. So the queries are super simple. If you see into the cluster CPU usage and you go edit, everything is super visual. So give me from the vSphere host CPU where cluster name is, of course, the variable cluster name, so the one that you have selected here on top. Give me the usage average, the last value that you’ll have. And of course, because it’s a time series, it just shows over here the time series. Super simple, it doesn’t have anything complex. So I really encourage you to start looking into this because it’s super, super, super simple and powerful into all of this. So yeah, a bit of usage average. Yeah. I don’t know what to show you anymore. I think this is a really good overview anyways. And if you have any questions, please put it on the Q&A, and we will answer them after.

Jorge de la Cruz: 00:32:08.138 So okay. Let’s go back here into the slide deck, okay? So learn from the community and build for the community. This is really important. This is kind of the last message that I wanted to give you before we jump into the Q&A. So let’s kind of talk a bit back. So before we were into this rocket going over to the moon or going at least to have a better monitoring tool, a quick, free, open source one - before all of this, I was using all that other products, other products based in SNMP, other products based - I don’t know - Windows, for example. And they were complex. They were difficult, especially about SNMP in the past using - I don’t know - other open source tools like Nagios, for example. That’s a good example. Nagios is great. It works out of the box, but it’s complex because you need to download - especially if you’re using the open source, you need to download the scripts that when you’re going to the Nagios kind of gallery, it’s - I don’t know - six years old. So it’s not getting much, much updates, so. The bottom line here is it was painful before kind of jumping to the rocket and so on.

Jorge de la Cruz: 00:33:26.779 And then this is the kind of community inception timeline, right? So in June 2016 - so it’s 4 years already - I just wrote my first Grafana blog. Same as maybe you today, I just went into the website and then - I’m just tired of all of these products that they are really complex, and they don’t give me any beautiful and at the same time useful examples, dashboards, quick overview that I can put on my monitors. So I just started downloading Telegraf. I downloaded InfluxDB. And then I downloaded Grafana and built - I don’t know. I think I built like a Linux dashboard. It was super simple, but it was good. It was the first step, right? So then a year after looking for getting a better monitoring for VMware, I did use one PowerShell script from a fellow vExpert, which is - his name is Mike Nisk. That script was called - it’s still called vFlux Stats Kit, and that was a complex PowerShell that it could download all the data from VMware back to InfluxDB. It was good. It was version 1.0, so it was quite good, quite powerful but still quite complex because you need to do the power - you need to understand the PowerShell. You need to have this Windows with PowerShell to run all of this, back in the day in 2017 at least. Now, I know that in Linux you can run PowerShell commands and so on. But before, it was kind of tricky. Yeah. It was great. The rocket was there kind of outside the stratosphere but nothing beyond.

Jorge de la Cruz: 00:35:05.425 Then in September 2018, Telegraf announced the native vSphere input. So you could find it already - that configuration that I showed you is exactly the same since then. It changed - maybe they added a couple of extra things but really, really small. So when I saw the release notes - I remember that night. Probably I didn’t even sleep or something. It was - I don’t know - so powerful. I don’t know. It was like having a Christmas present on that day. Then September 2018, it was the last days. So then a couple of days after, I guess, they create all of these dashboards that you show already, and they just put them there in grafana.com. So yeah. The three new dashboards, they were created and serving the community at that point. And then since the 2018, so much feedback from Slack, the VMware - the Slack from VMware on the Influx forums, emails, Twitters, yeah, a lot of feedback like, “Hey, can you please add this? Can you please change that?” It’s really great feedback, all the feedback that I’ve been receiving about all of this.

Jorge de la Cruz: 00:36:16.883 And then at the end kind of - this is May 2020. Well, that will be a couple of days, right? But hopefully May 2020, there will be - probably it’s a bit more downloads now. So just it has so many downloads already. It has been downloaded a lot of times. The feedback has been absolutely powerful and really encouraging. And this is really built for the community. And I let the picture there of the InfluxDays because the InfluxDays in - when we were able to attend in person, that year, I attended the InfluxDays London, and it was really, really, really good event full of people, full of people sharing their experiences, the presenter sharing experiences same as I am doing today about VMware. But it was about sharing other experiences with other technology. It was so, so good that event, and it’s really this. It’s just really start from the beginning, install all these three components, and then from there, look into the forums, look into the GitHub, look into what you can or want to build. And if it’s not built, please build it, and then please share it with everyone because that’s the most - yeah, that’s the most and - I don’t know - a great thing that you can do. And here I am today just sharing my experience of how I started from the beginning just tired of legacy products, built it, built all of this and sharing it with you today in the case that is useful, and I really hope that is useful for you guys. And I think that’s everything from me. Caitlin?

Caitlin Croft: 00:37:59.851 Thank you, Jorge. I think that was fantastic. I loved your timeline part showing where you come from and where you are today. So as people start adding more questions, I just want to remind everyone of InfluxDays. Given all the current healthcare concerns, InfluxDays London 2020 has been turned into a virtual event. So it’s going to be really fantastic virtual experience for all of our InfluxDB users. So we also have our Hands-on Flux training on June 8th and 9th, and it will be obviously all virtual. It will be really great. We have some fantastic trainers who will lead it. And then we also have InfluxDays which will happen on June 23rd and 24th 2020, so just a couple of months away. So we’re super excited to have everyone join us in a slightly different, unique format, but we’re super excited to be able to share the information with you guys and see you guys all online. So Jorge, there are a few questions. The first one being, can you give us a URL so we can customize the configuration?

[silence]

Caitlin Croft: 00:39:36.555 Oh, Jorge, I think you’re on mute.

Jorge de la Cruz: 00:39:40.126 Oh, sorry. Yeah. I was talking there - you see the physical - okay. Can you hear me now [laughter]?

Caitlin Croft: 00:39:45.336 Yes. Thank you [laughter].

Jorge de la Cruz: 00:39:46.443 Okay. Yeah. Yeah. So you can download all the dashboards. And yeah. You can edit all of those. The configuration for the telegraf.conf, it’s inside the telegraf.conf. If you want to get that configuration itself kind of in text, because that’s probably what you are asking, yeah. It’s here. Probably it’s on the GitHub, but I’ll show you here because it’s exactly the same. And I will put here vSphere, Telegraf. And then you can go here, and you will be able to pick it from here. Yeah. Yeah. It’s here. And then just go over here. Just go to the official website. Or if not, just come here. I think if you put on Google VMware, Telegraf, Grafana, probably you’re going to end up into this, so that’s the configuration. Okay. Let me see into more of the questions.

Jorge de la Cruz: 00:40:49.123 It is possible to ESXi metric such as esxtop via Telegraf? I’m pretty sure you can do it. Yeah. Yes. You will need to build some custom script. But yeah. Using the PowerCLI that you can use today already from Linux, I’m pretty sure you will be able to, yeah, download esxtop results and parse them and then later on save it into InfluxDB. I will say yes. They’re completely possible. Can you review after creating the two conf files where do you put those? What directory into the Telegraf? Oh, yeah. This is the directory where you put them. Okay. I mean, it was here. So I usually tend to recommend to put them in /etc/telegraf and then telegraf.d. And then here you put all that conf - those conf files. So it can be, for example, vCenter one, vCenter customer two, vCenter customer three .conf and so on and so forth, always inside telegraf.d. So you do not edit the telegraf.conf, which after updates and other stuff always asks you, “Do you want to overwrite the telegraf.conf?” blah, blah, blah. It’s much better you just have your different vCenters, or anything that you want to monitor with Telegraf, inside the folder with the specific configuration.

Jorge de la Cruz: 00:42:05.369 What’s your opinion on using a VM as a Docker container to run Telegraf? [inaudible] VM container improvement? You can do it, but remember, if you’re starting to leverage Telegraf massively, heavily with thousands, millions of metrics, that container is going to be a massive beast. It’s much better to just think about how much metrics do you want to store on that and, yes, plan ahead into that. Anyways, if you start with these and then you need it in the future, you can always export the InfluxDB database and import it into a new VM, to a new system, more powerful. You can do that. URL for the documentation? Yeah. It’s here already. Is there a reference Telegraf config available somewhere? Yes. It’s here. So you can just download this. Just change only this, and it’s going to work out of the box without any problem. Yeah.

Jorge de la Cruz: 00:43:03.394 Okay. Is there also a Docker, Kubernetes official source for Telegraf? I think it is. Yes, for the whole Influx stack itself. I’m pretty sure it is, but for this together - I mean, this is just really - you just enable something inside Telegraf, so I will say yes. You just download this and - do you recommend monitoring [inaudible] through vCenter or deploying Telegraf directly on VMs and use inputs such as performance counters? You know what, Paul? I will recommend to use both because it’s free and you can do both ways. As much monitoring as you have, probably the better. I think the VMware is going to give you some metrics, which they are fairly accurate, but the guest operating system is probably - it’s going to probably give you much better metrics of the consumption itself, of the operating system. So I will probably combine both.

Jorge de la Cruz: 00:43:56.238 So there is no URL for the API documentation. I’m looking for your config only. For the API documentation - yeah. It’s on the telegraf.conf. If you go into the vSphere part, it’s a massive documentation on how everything is done for all this Telegraf plugin for VMware. Yes. It’s really detailed. Yes. Search that on the Telegraf GitHub. [inaudible] containers have better later version - okay. It’s good to mention that Telegraf will read all files without conf extension. Yes. This is true. So everything that you put here with a .conf - it’s going to read absolutely all the files. So for that reason, I tend to recommend to really put everything inside this folder because it’s going to be [read it?] and configuration. Check my blog to, okay, find references. Okay. Thank you for that. Let me see if everything else is Q&A. How about creating actionable -? Yeah. Tell me, Caitlin.

Caitlin Croft: 00:44:59.539 Oh, yeah. There is another question here. Why not start using Telegraf to consume native vCenter API instead of SNMP? Do you think it’s fair to Telegraf to talk to native vSphere and native API?

Jorge de la Cruz: 00:45:20.491 What is it? Is it here on the chat itself? Okay. Oh, yeah. Okay.

Caitlin Croft: 00:45:25.101 It’s in the chat.

Jorge de la Cruz: 00:45:27.077 Why not start using Telegraf to consume native vCenter API instead of SNMP? Do you think it -? Oh, yeah. This is what we have done over here. Yeah. The Telegraf talks to native vSphere API. Yes. That’s okay. So yeah. The first slides about SNMP, it was to prove to you that it was so, so complex to do everything through SNMP. And then at the end of the session, it was really just enable this because Telegraf talks directly with API. Yes. Can I use Telegraf, Elastic, Grafana, your dashboard? Yes. You can do it. Yeah. Oh, sorry. Elastic, no. It needs to be Elastic, InfluxDB, and Grafana. I do not think that this Telegraf can send into the Elastic. I need to double check. I need to double check. I will probably get back to you, John, later on. Carlos Gomez, “Thanks, Jorge.” Thanks, Carlos.

Jorge de la Cruz: 00:46:22.336 What else? [inaudible] creating consumer alerts? Can you show an example? Alerts. Yes. You can create a lot of alerts using Chronograf or through Grafana. As you know, you can create alerts as well over here, but I will probably use Kapacitor for this instead of Grafana. Grafana is a bit limited on the alerts that you need to create and so on. So I will probably use much better Kapacitor. Yeah. So just take a look into Kapacitor to create alerts. What else? Is this Influx Database 1.x or 2.x version used? At the moment, I’ve been using 1.x overall this time. I need to jump into the 2.x and see if this is supported. Probably it’s not much changes, but I’ve been using 1.x at the moment. Sorry. Do you have an option to using Prometheus versus Telegraf, or are they too different to compare? The plugin itself that we’re discussing today is in Telegraf, so I’m not sure if it is in Prometheus by any means. They are different plugins themselves, right? So I cannot answer that at the moment. I’ve been using the Telegraf one that the guys of Telegraf - they added, so.

Jorge de la Cruz: 00:47:36.510 How many vSphere hosts InfluxDB size for? At what point should we leverage the enterprise version? There’s kind of a really great blog post that I’ve seen into the Influx blog, official blog post itself. If you look by size in InfluxDB, they can tell you - it’s really great blog post. It goes really detailed about the size in the InfluxDB. And it’s really per metrics itself and measurement. So just take a look into that. I cannot really answer that quickly. I wish I could in one minute, but it’s much better to take a look into that and make some numbers on the background, for your background environment. Can you create buttons in Grafana to control devices? Probably. If you have like attached screen or something, but that’s outside of the scope of the webinar today.

Caitlin Croft: 00:48:38.252 Thank you so much, Jorge. It sounds like a lot of people have lots of questions for you. If anyone has any further questions, please feel free to put them in the Q&A box. Can you create buttons in Grafana to control devices?

Jorge de la Cruz: 00:49:03.348 I guess you can, yes. But it’s probably out of the scope of the webinar. I don’t know. Maybe.

Caitlin Croft: 00:49:11.553 Okay [laughter]. Maybe that’s something else we can look into, Jorge [laughter].

Jorge de la Cruz: 00:49:19.779 Yeah [laughter]. Maybe. Yes.

Caitlin Croft: 00:49:27.533 Well, perfect. If anyone has any further questions, you can always email me, and I can always connect you directly with Jorge. I know what it’s like on these webinars. You think you’ve asked all your questions, and then afterwards you really hope that you could have asked another question. So please feel free to email me. We have another question here. Is there any work going on for AHV monitoring?

Jorge de la Cruz: 00:49:57.168 It may be. I mean, not that I’ve heard anything officially from the Telegraf guys, but AHV, they do have RESTful APIs. So using the RESTful APIs, you could probably build it really and fairly quick. And then be the next Jorge. And then you can present it as a session how you achieved the AHV monitoring. That would be really great to see actually.

Caitlin Croft: 00:50:26.158 Fantastic. Great. So this webinar has been recorded. It will be published on our website by the end of the day today. And just wanting to let everyone know, in case you are super excited about InfluxDB after joining this webinar, we do have a virtual time series meetup taking place in one hour. So if you want to learn even more about InfluxDB, we actually have another InfluxAce speaking on that in just over an hour. And I will send everyone the message in the chat here. So if you’re interested, please feel free to join us. You will see some familiar faces that are on this webinar. Thank you, everyone. I hope you have a great day and please let me know if you have any further questions for Jorge.

[/et_pb_toggle]

Jorge de la Cruz

Systems Engineer, Veeam Software

Jorge de la Cruz is a Systems Engineer, husband, and father living in the UK. He has been an active blogger since 2011 and his main expertise, when not working as a Systems Engineer, is to consume and dissect RESTful APIs and send data to InfluxDB to later show it within Grafana. When not working on the computer, Jorge enjoys preparing home-made food like bread, pasta, and roast "everything" with special expertise in Spanish dishes. Jorge de la Cruz is an InfluxAce.

Modern vSphere Monitoring and Dashboards Using InfluxDB, Telegraf and Grafana

Watch the Webinar

Jorge de la Cruz

Systems Engineer, Veeam Software

Session Registration

Product & Solutions

Developers

Company

Modern vSphere Monitoring and Dashboards Using InfluxDB, Telegraf and Grafana

Watch the Webinar

Jorge de la Cruz

Systems Engineer, Veeam Software

Session Registration

Product & Solutions

Developers

Company

Follow Us