Coming soon! Our webinar just ended. Check back soon to watch the video.
How to Gain Real-Time Visibility into Your IaaS with vBridge, InfluxDB, Grafana
Webinar Date: 2021-06-01 15:00:00 (Pacific Time)
2021-06-02 10:00:00 (New Zealand Standard Time)
vBridge are the creators of a multi-site IaaS platform, which provides clients with fast and reliable data storage and cost-effective computing services. Their cloud infrastructure monitoring solution aims to provide the simplicity, flexibility and control required by their clients. vBridge’s solution lets customers generate ad hoc performance graphs of their virtual workloads. Their API stores metrics on every request (http status code, response times, endpoint, etc). Discover how vBridge uses InfluxDB and Telegraf to collect and store backend metrics from Pure Storage and 3Par storage arrays.
In this webinar, Ben Young will dive into:
- vBridge’s methodology to hosting infrastructure as a service
- Their approach to delivering superior processing power, meeting uptime SLA’s and providing disaster recovery
- How vBridge uses a time series database to empower their clients with real-time monitoring of clients’ backend systems
Watch the Webinar
Watch the webinar “How to Gain Real-Time Visibility into Your IaaS with vBridge, InfluxDB, Grafana” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How to Gain Real-Time Visibility into Your IaaS with vBridge, InfluxDB, Grafana”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Caitlin Croft: Customer Marketing Manager, InfluxData
- Ben Young: Head of Cloud Products, vBridge
Caitlin Croft: 00:00 Once again, hello, everyone, and welcome to today’s webinar. My name is Caitlin Croft, and I’m joined by Ben Young, who works at vBridge, and he’ll be discussing how to gain visibility into your Infrastructure as a Service using vBridge, InfluxDB, and Grafana. This session is being recorded and will be made available for replay sometime tomorrow. The slides will be made available. And please feel free to post any questions you may have for Ben in the Q&A box that’s down at the bottom of your Zoom screen. You can also throw it in chat. We’ll be monitoring both. Without further ado, I’m going to hand things off to Ben.
Ben Young: 00:43 Thank you very much, Caitlin. Glad to be here. When I kicked off with InfluxData and 2015, I never really thought I’d be doing a webinar on it, but here we are. So, as mentioned, I’m Ben Young. I’m head of cloud products for a company called vBridge down in New Zealand. Not to be confused with Australia or an island of Australia or anything like that. You can find me personally on Twitter there and I blog about a few things at that your URL and our corporate address vBridge.co.nz.
Ben Young: 01:17 So a little bit about vBridge. Won’t bore you to death here. But like I said, we’re from New Zealand. We’re primarily a private Infrastructure as a Service company based in New Zealand. Two sites, Auckland and Christchurch, highlighted there on the map with the orange circles. We’re about 11 years old. We primarily deliver IaaS, but we also have a number of other services that sit around the side of that, remote backup solutions. And we’ve built a self-service portal that sits on top of all of this and automates both sites so that customers can consume our services wherever they need to and whenever they want.
Ben Young: 02:00 So quick look at the agenda. A lot of it would be relatively technical-based or at least sort of solutions. How we put everything together, obviously, focused on time series data. So a little bit about how we deliver cloud services. And this is important in leading into why we use and leverage time series data because it’ll become apparent when we get to that. Then, how we deliver real-time metrics to our customers. So we obviously have a number of services to them. They need to be able to get access to those key metrics. “Is the machine performing poorly? What’s going on?” So how we deliver those metrics. And we’ll do a bit of a deep dive into how we’re using Grafana on top of that and actually leveraging all the APIs to generate those in real-time. And then a little bit of a look at how we maintain our platform, not only from a capacity planning perspective. We monitor a number of our storage arrays, various different cloud services. So making sure that we are performing the way we should, but also we can get a good view into the future around how much capacity we need to maintain to keep our platform where we want it. And then we’ll just open it up for some questions. There may or may not be any of those. But if there are, I’m happy to answer those. So let’s get into it.
Ben Young: 03:30 So the vBridge way. So being down in New Zealand, we’re in a slightly unique position. We’re sort of cut off. We don’t have hyperscalers right at our door yet. Although Azure are coming to Auckland soon. But we can compete in different ways. We can’t compete on price. We don’t have the scale that they do, but we can beat them at a number of these things. So security is a key tenet of our organization, the way we design services and deliver them. We’ve recently gained an ISO 27001 Accreditation. We perform regular penetration testing on all of our platforms, including our corporate systems. So that really gives customers confidence in using our services, particularly government customers and large enterprise.
Ben Young: 04:24 Reliability. So we’ve got a high level of reinvestment in our platform. We use the best-of-breed products. And Influx is a key part of that too. Like I say, to make things reliable, we need to be able to monitor it. We have a strong customer focus. So I mean, lots of organizations say this, but we do live and breathe it. We love working with our customers. We love getting feedback from our customers and actually building that into all the solutions that we make, including the self-service portal, which then empowers them to use it. And then performance. Again, massive crossover with how we use Influx and monitor it. But we’re really a market leader in this space, we believe. We were the first to market here with an all-solid-state solution and then later on within VME with the pure storage arrays. And like I say, we measure absolutely everything.
Ben Young: 05:22 So enough of the kind of corporate stuff. The stack. So we obviously use a lot more systems than these, but these are kind of the primary products that we use to kind of deliver all of our cloud services. A sort of storage layer. We’ve got Pure Storage as our primary storage platform. We’ve got some Hewlett Packard Enterprise three power arrays. We’ve got Cloudian for object store. So pure and three power will deliver all the block storage for our IaaS, we’ve got a Cloudian array that’ll deliver some object store, and then we’ve got some Isilon for what we call Enterprise NAS. At a compute layer, we’re using HPE Blade and Synergy systems and Dell MX compute as well.
Ben Young: 06:05 Networking layer, primarily Juniper at all of our core, and then out at the edge, and also have a service wrapped around it, we use Fortinet. And then software, Hypervisor’s VMware. We use Veeam for not only backing up our Infrastructure as a Service, but we also have a number of their other products, such as Cloud Connect remote backup and backup for Office 365. And we also monitor these within Flux. Microsoft, obviously, Influx, Grafana. I’m a programmer by trade. I’m a .NET developer. So all of the stuff that we write here is in .NET. We use Metabase and Mailgun for some emailing out. So quite a few pieces in here. Like I said, it’s just a few of them, but it’s sort of what I use in touch day-to-day.
Ben Young: 06:57 So then we have this secret sauce, we call it. So this is the portal I alluded to earlier. We call it MyCloudSpace. Its multisite. It’s not really built on anything. It’s all handcrafted by us. It’s not built on vCloud Director or anything to drive all of the hypervisor. We actually integrate directly with all the APIs. And then it also reaches out to things like Fortinet, all of the Veeam products, Cloudian, Isilon, to allow customers to spin up these services automatically, much like the experience they’d get in AWS or Azure or any of the other cloud providers. And we’ve won a few awards for it. So we’re pretty proud of it. And yeah, customers use it every day.
Ben Young: 07:39 So why Time series? When I started looking at this in 2015, we needed something that was going to grow into the future that we weren’t going to have to manage. We wouldn’t have to have the pain of dealing with large SQL databases and query times and indexes and all that sort of stuff. So really, the data that we were looking at sticking in and wanting to collect was perfect for time series. So that’s why we went ahead and did this. It obviously scales predictably, not only in performance, but in terms of growth. And it plays nice with the other tooling. At the time, Grafana was still pretty new, but there was a lot of things that we wanted to do in that space. So it works really well with other products. And it’s really easy to integrate with. So you’ll see in a minute about how we’re getting the metrics into Influx but also, we’ve got a number of other systems pushing data in. So with that HTTP API, we’re able to just push data on from almost anything that can invoke a web request, which is really neat.
Ben Young: 08:42 And why did we choose Influx? So I started looking for a solution back in 2015. The end of. And I’d like to say that I went through a massive process of selecting. But to be honest, it was pretty easy. At the time, Influx had the most mature market and product. The community was great. Everyone was really involved. So it was a pretty easy choice from our perspective. And we haven’t looked anywhere else since. It’s never let us down. It still continues to grow in maturity. The community’s still brilliant. So, yeah, we’re not planning on going anywhere. In fact, we’re planning on sort of doubling down on how we’re utilizing it.
Ben Young: 09:28 So let’s start taking a look at some of the solutions. From a customer-facing perspective, like I mentioned, we deliver everything through our MyCloudSpace portal. So what can customers do in here? Really, it’s about getting real-time metrics, or even historic metrics. Now, the way that vSphere collects all of these metrics out of the hypervisor is they obviously collate all the data, but it starts to smooth out the values. And we wanted to make sure that in our solution, a customer could go back to a month ago, a year ago, or see very big periods of time with very granular data. So we decided early on that we wanted to pull that data out, stick it into Influx, and then just basically keep it forever, or until we decide that we’ve consumed enough space and start archiving things out.
Ben Young: 10:18 So we collect all of the hypervisor metrics, not just the stuff of virtual machines, and that lets us capacity plan this stuff, which we’ll get into later. But from a customer perspective, they can sort of see their CPU and memory utilization, anything around their disk throughput, as well as their IOs, anything on their network stack. I’ll show you in a second, but basically, the screenshot down the bottom here, is kind of what they see in our portal on a per VM layer. It defaults the last hour, but they can very quickly generate 6 hours, 12 hours, and so on. And then we can actually give them on-demand graphs. If they want something from, say, December last year, we could do that and just deliver them a custom URL.
Ben Young: 11:05 So what problems does it solve? I mean, it gives immediate access to our customers to all of their vitals. If they’re trying to troubleshoot something in their environment, they can just log on, do these metrics, and generate bigger periods. Obviously, we touched on this, but the ability to show granular, non-smooth metrics. Now, that’s really big, particularly if you’re trying to diagnose performance issues and you’re looking at a graph that’s been smoothed to five minutes. Obviously, a lot can happen in that time. The really nice thing that — we can’t self-service out of here yet, but what we can do is the ability to stack multiple servers on a single graph. So let’s imagine for a minute that a customer had a SQL box with a CRM server and whatever other servers that made up that particular application that was having an issue we can actually stack them up so they can see, “Okay. Hang on. This server here or this component looks like it might be the troublemaker.” And then use that as a bit of a triaging and diagnosis tool. So our customers use that all the time. And it was super easy to integrate with InfluxDB. We obviously send metrics from MyCloudSpace, but also getting them from vSphere was easy, and Grafana sits on top really nicely. So it’s solved a number of issues from us.
Ben Young: 12:30 So architecturally, this is how it sits together. We’ve got a couple of data centers. Obviously, this would just scale horizontally as required. We’ve got InfluxDB 1.X sitting in here with Grafana on top. We have a little Java process still running that’s collecting these metrics. Basically, polling them and pushing them into Influx. And we also have the relay between the two. So the data center one, this collectors just pulling the metrics from here, pushing it to here, but then it gets pushed sideways up to Auckland. And this is important if we look to failover MyCloudSpace from one site to the other, each site has a copy of each other’s metrics. We have a reverse proxy, which I’ll touch on. We’re using IaaS’s reverse proxy for this particular solution to sit in front of Grafana. And then we’ve obviously got MyCloudSpace at the top and people consume it. Now, we run a sort of active-passive design where we can just failover from one to the other. And that’s why we’re relaying those metrics up and down the country, and that works really well. So like I said, that’s the sort of current architecture. Works brilliantly. It hasn’t changed a lot since sort of 2015, 2016. We’ve upgraded it a few times, but for the most part, it hasn’t skipped a beat.
Ben Young: 13:55 So if we look into the future, we’re already sort of running this architecture. So in parallel to this Influx environment that we’re running, we’ve actually started collecting metrics using this kind of design. So, again, a couple of data centers. We’ve now got effectively a dockerized version of Influx. This is really handy because now we can easily spin up dev/test environments. We can test different things. So this has been really good. We also inject all of our base templates into Grafana using the stack. So that’s been really good in terms of how we deploy and manage and test things here. And then you will notice the introduction of Telegraf. So we’re using that now to collect the metrics just with the native plugin there out of vSphere, instead of using this Java process, which admittedly, was a bit old and a bit long in the tooth. And it has its issues from time to time, but, hey. So that’s been great. That’s been a really nice upgrade as well. And then the rest of the stack is the same. Still running Grafana. You’ll notice now that the Influx database and Telegraf is actually sitting out in a different zone. We’ve actually got now a shared data zone, which sits here. Because instead of us duplicating efforts here for the internal monitoring side of things, we’ve now just got a single shared data instance. And then MyCloudSpace can just consume and push data as required into this zone.
Ben Young: 15:30 So snapshots. I looked yesterday and my Git ticket’s still open in Grafana’s project. Because I don’t know if you’ve ever created snapshots in Grafana, but when you do, it’s all client-side driven. So what I mean by that is Grafana, effectively, your browser queries InfluxDB, and depending on the way you’ve set it up, it either proxies it through Grafana or it can go direct. Which is all well and good until you want to start creating snapshots programmatically. So our API on demand creates these snapshots every time those VM’s load. The issue being is we can create snapshots programmatically, but they’re just empty. If you go to view them, there’s no data in them, even though the queries exist within the panels. So we’ve had to effectively mimic what the front ends doing and pull out that data manually.
Ben Young: 16:29 So I’ll run through this. So effectively, a user will request some statistics for a VM for X many hours. We will then reach out — this is our API here. We’ll reach out to Grafana’s API. Create the snapshot. It obviously returns us this empty snapshot. We then take a look at the snapshot it’s created and look through to the panels in each of the queries that it was wanting to run. We then asynchronously run all of our queries off to Influx directly ourselves from our API. And this is what I was talking about here, about its easy to integrate with. We can just query that data whenever we want to and however, we want it. We get that data back from Influx and then we have to sort of apply some magic to it because it’s not quite in the right format that the Influx, sorry, the Grafana front end sort of converts it to and injects it into. So we sort of mimic that process. And we modify the JSON in our API, and then we save that back to the Grafana API and say, “Here’s your snapshot,” which is basically what the front end in Grafana’s doing. And then we basically return the URL. So then at that point, we’ve got grafan.mycloudspace.co.nz/, your really long query string for that URL, and then we inject that and embed it in the panel.
Ben Young: 19:24 So let’s hope the demo Gods are kind, but I’ve got one here. So this is a little web server I have. You can see that we’ve got a performance dashboard. A customer can come in here and click and zoom. They can also zoom back out as required. You’ll notice that if I zoom all the way out that this is just the last 60 minutes’ worth of data. We’ve got the compute services around how hard the processes are working. The memory. The yellow line at the top is sort of how much is allocated to the machine, and the green line is how many active pages are happening at a memory layer. We’ve got how much throughput the network interfaces are doing, what Disc IO, both read and write, and the actual disk throughput as well.
Ben Young: 20:14 Now, if a customer wanted more data than the last 60 Minutes, then what they can do is they can just come in and click one of these buttons. So if they want the last six hours, you’ll notice that’s now gone off. Done that process around creating a snapshot in the API, running all those queries, manipulating all the data, injecting it all, sending it back to Grafana, and then returning that URL, and then we’re consuming this through that reverse proxy. So now you can see that I’ve got more data that I can in and zoom with. So that’s kind of how we deliver these metrics to the customers. The nice thing is because it’s all programmatically driven, if we decide we want to add another panel down here or change the way this is presented, all we need to do is log in to Grafana, modify that base template, well, that we’re kind of snapshotting programmatically, and then the next time it generates one, our API will just make those additional queries just the way we’ve put it all together. So that works really well.
Ben Young: 21:10 So back here. Here’s what I was talking about before around multiple VM snapshots. I’ve removed the names because this was some client data. But this was a CCTV environment. So you can see here very quickly how powerful this could be around stacking up multiple CCTV servers and seeing this green one here is higher or whatever if they were trying to debug. We can do the same for memory. We can do the same for disk IO and networking. You get the idea. So at the moment, that’s all we’re doing for customers. You’ll see maybe at the end of these slides around where I would like to go in the future for customers around alerting and things. So I’ll leave that to the end.
Ben Young: 21:59 In terms of the internals, so this is really where we start looking at how vBridge use InfluxDB internally to deal with a few things. So I’ve picked out some examples here. There are a few others, but this will give you a good flavor for how we’re using it. So like I mentioned, we’ve got a 3PAR array. We monitor it using a shell script, which I’ll show you, and this allows us to do performance diagnosis and it also allows us to capacity plan around data utilization. We also have, obviously, two compute clusters. We want to make sure that we’re planning appropriately. Because the portal allows customers to spin up workloads whenever and wherever they like, we need to make sure that we’re monitoring kind of what’s been provisioned versus our overhead. So we use that in our capacity planning meetings on a regular basis.
Ben Young: 23:06 We’ve got multi-site Veeam Cloud Connect. So the tools out of the box that Veeam give you for monitoring Cloud Connect infrastructure doesn’t really exist. I mean, yes, we can monitor the Windows machines that everything’s installed on, but if you really want to dive deep into, “Okay. Well, we know that 33 people are backing up to this 1 proxy or through this piece of infrastructure,” it’s very difficult to do with the tools. So we’re able to monitor those environments as well. And that lets us appropriately patch and also monitor workloads if we need to spin out new proxies or additional infrastructure to support those workloads that are coming through. And then we use it for MyCloudSpace. So that portal that I talked about. That has some custom code in it that basically logs every single request back to Influx and we can monitor the health of that. MyCloudSpace integrates with a number of systems. Each one of those downstream systems have a number of different response times. So this helps us kind of make sure that everything’s in check. We don’t then need to worry about kind of what’s happening downstream. We can monitor everything up, which will obviously lead on to what customer experience they are having at that time for any of those requests.
Ben Young: 24:29 So 3PAR monitoring. Basically, in a nutshell, we have a script externally to 3PAR that logs in via SSH. That runs the statpd and showsys commands. That obviously returns some data. We kind of scrape that data and then that gets injected into InfluxDB and then we visualize it. So like I mentioned, external server on a timer. Nothing fancy. Logs on to the 3PAR with SSH. We get some data back. We chuck it down into Influx, and then we’re able to visualize it. So it’s nice and simple. Works great. And this is kind of some examples of what we get out the other side. So we can see disk latency. How many IOPS are going through. Virtual volume IOPS. The sort of world’s your oyster, really. And the nice thing about capturing all of this data is we’re able to retrospectively go back and analyze, “Okay. Well, what was it doing then?” Or we might want a different graph and we can just create those panels and start looking at it, depending on what kind of thing we’re trying to diagnose. So, yeah, these look really good and we use it all the time.
Ben Young: 25:44 So capacity planning. Well, this is now moving into sort of our hypervisor. And this is using that new stack that I mentioned before. We’ve got Telegraf sitting out there collecting metrics from the local hosted environment, sticking them down into InfluxDB, and then we’re using Grafana to monitor this. So this really allows us to not only eventually deliver when we come over to MyCloudSpace to use these same metrics, but allows us to report long-term on growth and performance. An interesting sort of side note is, when all of these CPU-based attacks were happening, obviously, we’re a multitenant platform. We need to make sure that we’re patching the environments. When all of these sort of CPU patches were coming out to allow those side attacks not to happen, that actually added quite an overhead to our CPU. And we were able to almost see overnight when the environments were patched, the additional overhead at a computational layer that those patches were making. And that’s allowed us to buy slightly different sized machines because we went from a sort of CPU-constrained — sorry, sort of a memory-constrained environment now to a CPU-constrained environment. So when we buy the new blades, they had a slightly different build. The CPU to memory ratio is slightly different. So having the data really lets us make decisions like that as we look to the future for whatever compute we’re putting in. So that’s been really good.
Ben Young: 27:25 So, yeah, this is kind of what it looks like. Nothing fancy. But if we saw a big spike, a customer might have ingressed a whole heap of data. We can see – this is across a few days – the normal workday things happening. You can make some pretty cool-looking graphs with it. So, yeah, that’s kind of it. So Cloud Connects a slightly different beast. As I mentioned, we can manage, or we do monitor, all of our Windows’s environments with our internal system monitoring. But that doesn’t really give us an insight into what Veeam is doing. So thankfully, they actually have a number of WMI counters that we can query. So that’s exactly what we’re doing. Very similar to the 3PAR, we have a server here querying with WMI to all of our Veeam servers. It’s represented by one logo here, but there’s obviously multiple environments and multiple components. We are then pulling that data back, saving it off to Influx, and we’re able to visualize it with Grafana. And in a similar way that we’re able to capacity and performance plan and monitor the hypervisor environment, we’re also able to do the same with the Cloud Connect environment. And we sort of get graphs like this. And given the nature of this product, someone could be backing up at any time of day. We just don’t know. We could have a real rush on it at 5:00 PM because it’s the end of the business day. So this really helps us monitor it. It also means if we want to do some maintenance on a particular node, we can come in here and see exactly what’s happening on this one particular node. So it would be really nice if Veaam would start building out some of these types of reports for service providers, in particular. But this works brilliantly. And thankfully, they’ve got those WMI counters that are really easily accessible and we’re able to utilize those. So, yeah, some more graphs running concurrent tasks and other bits of pieces.
Ben Young: 29:31 And then we sort of dive into MyCloudSpace health. So this is really important because if this performing poorly, then our customers are having a really bad time, and we don’t like that, especially when we’re trying to empower them to use this portal. We want to make sure that we have enough resource provisioned out at the web layer, but also a lot of the downstream systems that this integrates with, which is growing day by day, we want to make sure that they are performing as expected. So this is really not the single source of truth, but it’s a really good place to start diagnosing issues. So we’ve got a wee message handler that sits in our API and it logs every single request. So every single endpoint. We strip out, obviously, any personal session data, and we just record what endpoint being hit, what was that response time that was generated end to end, and then we’re able to visualize the performance. But we also send other metrics. So every login event, successful or otherwise, we log into MyCloudSpace. And this literally gives us a good visibility into who’s logging in. Well, not necessarily who’s logging in, but how many people are logging in. And if we saw a large spike in failures, someone could be trying to knock on the door with a script and trying to break in. So it’s a really good way of visualizing that. And obviously, we’ve got Grafana sitting on the top. So nothing really fancy here. Up in mycloudspace.net, we’ve got a message handler. Every request just gets pushed down onto InfluxDB, and then we’re able to visualize it.
Ben Young: 31:18 And this is kind of what it looks like. So as you can see, there’s lots of dots here. This is over a relatively short period of time. So these are all the requests back into — a lot of them are into internal systems. A lot of the lower response times ones are just stuff that are coming from MyCloudSpace, but you can see there’s a big variance. These orange dots here, you can see, were sort of up near sort of 30-odd seconds and then they drop down here. So this could indicate that maybe that downstream system was having a bit of an issue. But this is a really good place for us to come in and start diagnosing. Even just monitoring we’ve got this up. I often hit this up, just monitoring it to make sure that everything looks the way that it should. We’ve also got other things like endpoint hit rate. So we can see every – in this particular segment – 15 times we’ve got a lot of other partners that use our APIs to pull out billing data or do things with bots. We want to make sure that they are behaving and not smashing our APIs. So this is a really good way of seeing that and visualizing that. And actually, sending them the graph just saying, “Hey, look, this is what you’re doing.”
Ben Young: 32:27 We also log status codes. So we saw a large number of 401s or 403s, that would indicate unauthorized. That could be someone trying to, say, hack us. Or, HTTP 500 errors, we’re obviously seeing a large number of server side errors. So that could mean that a downstream system is down. Something’s not happy. And then, obviously, we want to see just a big, long list of [inaudible] because that meant that everything’s going well. This is kind of the successful log-in graph. So we’ve got a patented login-o-meter. Not really. But basically, you can see here we can see how many people are logging in and when. If there wasn’t any in this graph, there’s another red line here, which will show failed logons. And then we’ve got a slightly different panel. This will default to the last hour. So that works really well for knowing who and how many people are logging in.
Ben Young: 33:31 And so then, where to next? So we’re running Influx 1.8 in that new futures environment. The one with Telegraf. So I’d really like to get us to 2.0. That obviously opens up a number of new opportunities for us. Some of them being I would like to extend what the customers can do. I think we’re pretty well covered internally, but I would love to have the ability to consume these VM stats and actually start proactively monitoring, processing the data, and actually alerting the tenant based on some rules that perhaps we set, or even better, that we could have the customers set the rules and the thresholds that they want. So I’d be interested in doing some work in that space. Obviously, internal alerting and advanced monitoring. We can probably hook this up, which kind of leads into the next point, but some more data sources. But actually, in a similar way that we would do it for a customer, we could alert based on unusual activity, or, “The growth rate has changed from this period to this period, and we’re going to run out in X many days.” There’s lots of different use cases there. Once we sort of scratch the surface of it, I imagine, we could probably come up with 100 ideas.
Ben Young: 34:59 And then, obviously, the hot topic on everyone’s lips is machine learning. It would be pretty nice to have predictive alerting, monitoring, analysis, what-if scenarios, either from an internal perspective or a customer perspective. Now, I don’t know necessarily exactly what where I would start with that, but I think that could be a game-changer for customers and ourselves in terms of monitoring. And particularly that what-if scenario. What-if I increase the memory to 24 gig, or what-if I reduced it, what would happen? And there’s lots of other scenarios that we could deal with there. Or even looking at ransomware. If we could work out and find a pattern for unusual disk activity and actually start looking through all of the metrics for that kind of signature, could we start alerting to say, “Hey, you want to go check out the server because we detected a ransomware signature on the disk writes or reads,” or throughput, or whatever it might be? So I think there’s lots of opportunity in the future for us to use machine learning, much like every kind of industry.
Ben Young: 36:19 That’s kind of all I had for you, really. So I don’t know if any questions have come through or even how they get delivered to me, but I’m sure someone will speak in my ear soon. So thanks for putting up with me. It’s been really good. I hope you found it somewhat enlightening. And like I say, if you’ve got any particular questions, ask them now, fire them through or you’ll find me on Twitter or on my blog. And yeah, more than happy to answer it or give you a demo or whatever you need. So thank you very much.
Caitlin Croft: 36:53 Thank you, Ben. That was great. Before we dive into the questions, I just want to remind everyone again of InfluxDays North America coming up on October 26th and 27th. So Call for Papers is open. So please submit your abstract. If you’re using InfluxDB, Telegraf, anything part of the platform time series related we’d love to see your submission. So please do that. I think it’s open until the end of the month. All right. So Ben, a couple of questions. How was it going from Java to Telegraf? I’d love to learn a little bit more about what that process was like? What started it? Any tips or tricks you have along the way?
Ben Young: 37:40 Yeah. So the Java thing, it was a little application called vSphere 2 metrics, with the number two in the middle, that we used. And that was all that really existed at the time. What drove the change was that the vSphere 2 — I don’t write Java for starters. But also, the number of metrics and the types of metrics that we were collecting we’re sort of stuck in time. So as vSphere has grown, there were a lot of other metrics that were able to be surfaced through vSphere that they weren’t aware and weren’t grabbing and that we wanted to particularly report on. So that was kind of the main driver for us looking at Telegraf, or actually just changing. So the process for us getting started was really easy. I mean, thankfully, there’s 1 million and 1 examples of how to get started. There was a few and I can’t recall off the top of my head exactly what we needed to tune, but because of the size of our environment, there was a couple of specific kind of how much data and how often and a few things that you can configure in the Telegraf module just around what metrics to pull at what rate and what granularity. So there was a wee bit of tweaking there because we found that there was just too many metrics to collect in such a short period of time. But once we kind of got that ironed out, we got everything the way that we wanted it.
Ben Young: 39:13 So in terms of transitioning, it was really easy. And like I mentioned, the two things it really solved was it got us the additional metrics that we needed that the vSphere 2 metrics thing couldn’t grab. And there was a lot of reliability issues with the Java process crashing. And then we’d come on and find that there was big hour-long gaps or whatever within our data, which wasn’t great. So we eventually wrote another shell script which ran, effectively, on a cron job to make sure the process was running. And if it wasn’t, start the process. That sort of fixed it, but a bit of a band-aid. But since moving to Telegraf, it’s been a delight. I mean, not only does it never miss a beat but also having that more modern environment that’s able to be spun up with Docker means, when we want to upgrade it, it’s easy. When we want to spin up another environment or test something, it’s easy, rather than having to spin up a new server, install all of the components on it. It’s been brilliant. So, yep, that’s been a really nice change. There’s been nothing really negative come out of that whole process.
Caitlin Croft: 40:23 Oh, I’ll be sure to share that with the Telegraf team. I’m sure they’ll enjoy that. Which plugin are you using for Telegraf? Are you using the client libraries? Just kind of curious about that.
Ben Young: 40:37 I don’t know, actually. We’re just using the standard vSphere plugin that sits inside it.
Caitlin Croft: 40:45 There’s just so many different Telegraf plugins and I can’t even keep track of them all. There’s close to 300 of them and most of them have been built by the community. And so I was just curious which one you were using.
Ben Young: 40:58 Yeah. It’s whatever the most — I can’t remember the exact URL to it, but it’s the one that is most commonly used when you want to consume the vSphere data. Yep.
Caitlin Croft: 41:11 Cool. And you mentioned that you’ve been using InfluxDB since 2015. So I would love to know, were there — because that’s a pretty long time. That’s six years. So have there been any interesting long-term trends that you’ve discovered? Any downsampling that you’ve had to do?
Ben Young: 41:31 No.
Caitlin Croft: 41:31 Anything that’s changed because you have this data?
Ben Young: 41:35 Not really. It’s been a relatively static environment for us, which has been nice. Like I mentioned, it’s been super stable, which has been brilliant. We kind of guesstimated how much space we would need and that worked out pretty good. And we’ve had to only really extend the capacity a couple of times on the drive. There’s been no loss in performance, which has been great. So as the data sets grown — I mean, we’re probably still only tiny in terms of globally what people are using Influx for, but I know for certain that if we were sticking this in the SQL database, it would have fallen over a long, long, long, long, long time ago. And we almost certainly would have culled out a lot of the data. So, yeah, it’s been easy to use and consume.
Ben Young: 42:28 And admittedly, when we solved the client-facing graphs, that was kind of the first reason we put it in. So that we could deliver those graphs in MyCloudSpace. It sort of sat static like that. We were just collecting data and delivering graphs. We didn’t really change much. Patched a few things along the way. And it wasn’t until sort of the last couple of years that we’ve started really looking at how we do monitoring and solving a couple of monitoring issues. That 3PAR one was a biggie. The tools that are provided from the Hewlett Packard side of things don’t really cut it when we’re trying to do real-time diagnosis. So that’s when we started looking at monitoring our things internally and then also started doing capacity planning and other bits and pieces. And then sort of in the last 9, 10 months we’ve upgraded to that new Telegraf stack with docker and other bits and pieces. So that’s kind of been the journey, really. But like I said, we really want to utilize it more heavily because it’s great. Especially with Influx 2.0, we want to start connecting up to different data sources and see what we could do there.
Caitlin Croft: 43:42 And have you started playing around with Flux at all as you look more at 2.0?
Ben Young: 43:46 Not yet. So that’s almost certainly the place I will start when we get there and I find some time to do it. So, yeah, I’ve seen and read a number of different blog posts and watched a few videos on it, and it’s pretty exciting.
Caitlin Croft: 44:05 Yeah. Well, we definitely have some good training courses to get people up and running with Flux. So it sounds like you guys hadn’t really ever used a time series database before finding InfluxDB. So were there any interesting lessons learned along the way during your initial implementation?
Ben Young: 44:27 I’m trying to remember back that far. [laughter] I don’t think so. There was the normal hurdles of when you have a new technology and how does this work and how to install us and what does that mean? And there was a — we were on such an early version, and I can’t remember the exact terminology, but it was when the way that the data’s stored changed on disk. So it was a bit nerve-wracking running the commands just because that’s what the documentation said, to cut it over to the way that the data’s stored on the back end. But that worked fine. So no other than the initial learning curve. And like I said at the start, that was what really led us to picking Influx was the fact it had that big community following and so much examples and people talking about it, and how do I do this? And, “I’m experiencing this,” which is the exact same thing that we’re experiencing. So be able to sort of diagnose it that way. That community is so important from our perspective. We normally buy things that are all completely supported, like 6-Hour support on all the hardware and yada, yada, yada. So we’ve got to make sure that whatever we buy is supportable. And this was slightly different, and it’s worked out.
Caitlin Croft: 45:48 Yeah. That’s awesome. I always love hearing from community members, especially one’s like you who have been using it for so long, where you sort of stumble across it. It’s clearly easy for you to learn, pick up along the way, and just run with it. And it’s kind of amazing how far we’ve come with 2.0 and then also with InfluxDB IOx which of course the team is working on. And the cool thing with 2.0 is there’s all the pre-made templates that people can get started with. So it’s even easier than it ever was before to get started with InfluxDB.
Ben Young: 46:22 Yeah. And I was just talking yesterday with someone about the developer experience and the same could be said for kind of I guess the operationary person experience. Having that documentation — your documentation is awesome — a lot of other projects lack, and that’s where it just makes it so hard to consume. And I’m actually integrating with a very large software company at the moment who I won’t name and their documentations awful and their developer experience is terrible. And just the hurdle to kind of get over those initial, “Okay. Now, I understand how you’ve put everything together.” It’s horrible. So yeah, it’s really refreshing to have good documentation and a good community and everything that goes alongside of it. It just makes it so much more easy to consume and recommend to people, right? And that’s probably part of the reason you’ve grown so quickly is because it’s a great product. It’s got the great community, and it makes it pretty easy word-of-mouth to spread the word and sing its praises.
Caitlin Croft: 47:26 Yeah. Our community is great. We had a time series meetup where we’ve been focusing on hobbyist project, people using InfluxDB at home. And it was really funny. It was all about home brewing, about making beer and stuff. And someone thought it was about the- someone thought it was about the package management system and was really confused. And everyone immediately before I could even say, “Actually, it’s about making beer,” everyone’s like, “Oh, no, this is about beer.” It was really entertaining. And there’s been other cases where someone asked a question. And once again, before I can even answer it, our communities like, “Oh, no, they’re going to get there. Don’t worry.” [laughter]
Ben Young: 48:11 Yeah. It’s funny, I just bought a Traeger smoker. And like everything in our world these days, it’s wirelessly connected to the cloud, and you can monitor it from your phone. And I was like, “I wonder if I could monitor it from here and stick the data into Influx?” And sure enough, you Google around and 101 different people have started doing it. So yeah, that’ll be to trial to stick it into my Influx thing here in my lab would be pretty fun.
Caitlin Croft: 48:38 Well, we did actually a time series made up about barbecuing. Monitoring your barbecue. So two different Influxers do it two totally different ways. So it’s kind of cool. Once again, that’s the power of open-source, right? There’s multiple ways to do it. So one of our guys uses a fire board, and he actually created an InfluxDB template. And then Will did a completely different way from Scott. Still was getting the same kind of data into Influx. So it’s a lot of fun. It’s a lot of fun to see what you can do with it at home and work. [laughter]
Ben Young: 49:12 Yeah. Totally. Yep. Definitely.
Caitlin Croft: 49:15 Awesome. Well, it doesn’t look like there’s any more questions. Thank you, Ben, for presenting today. This was really interesting. It’s always fun to hear about how our community is using our products. Once again, this session has been recorded and will be made available for replay. And the slides will be made available later as well. Thank you, everyone, for joining today’s webinar.
Head of Cloud Products, vBridge
Ben Young is Head of Cloud Products at vBridge, a cloud service provider in New Zealand and a Veeam Vanguard. He specializes in the automation and integration of a broad range of cloud & virtualization technologies. He surfaces these automations through a self-service portal that customers use on a daily basis to manage their IaaS and PaaS workloads.