Taking DevOps to the Next Level – The Five Steps
Companies are committed to delivering on higher levels of customer satisfaction for their online services. Unfortunately, many organizations trying to support these initiatives take an interrupt- driven approach where they monitor everything with every tool available. The steps you should take to manage to these high levels of SLAs is to start with a review of your current approach and toolset against the business needs to help you create a path to continuous service delivery optimization.
The first step in getting control and visibility into your DevOps environment is to collect and instrument everything. But how do you get started, what are the next steps. In this webinar we will distill the learning from hundreds of our customers into a simple 5 step process.
Watch the Webinar
Watch the webinar “Taking DevOps Monitoring to the Next Level – The 5 Step Guide to Monitoring Nirvana” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Taking DevOps Monitoring to the Next Level – The 5 Step Guide to Monitoring Nirvana.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Tim Hall: VP of Products, InfluxData
• Charlene O’Hanlon: Editor and Writer
Charlene O’Hanlon 00:05.334 Good afternoon, good evening, or good morning, depending on where you are in the world, and welcome to today’s DevOps.com webinar, Taking DevOps Monitoring to the Next Level—The Five Step-Guide to Monitoring Nirvana. I’m Charlene O’Hanlon. I’m the moderator for today’s event. And I welcome you all for joining me. Before we get started with the webinar, we do have a couple housekeeping items. First of all, we are going to take audience questions for today’s webinar. So please, at any time during the presentation, if you have a question for our panelists, please feel free to use the control panel, submit your question, and we should take about 15 minutes or so near the end of the webinar to go over the audience questions. Also, today’s webinar is being recorded, and anybody who signed up for the webinar will be sent a link to the actual webinar and to the slide deck following the event. The webinar will actually also be located on the DevOps.com website on demand. So if you’ve signed up and you’re with us, you will be receiving that link, so feel free to share that.
Charlene O’Hanlon 01:14.593 Okay. Moving along now. Let’s get to today’s webinar. Taking DevOps Monitoring to the Next Level—The Five-Step Guide to Monitoring Nirvana. Our guests today, our panelists today, are Tim Hall, who’s Vice President of Products, and Chris Churilo, who’s Director of Product Marketing at InfluxData. Welcome to you both. Thanks for joining me.
Chris Churilo 01:37.083 Thank you so much, Charlene. And we appreciate your time today. And we appreciate everyone being on this webinar. And we just want to start by saying that we understand you’re under a lot of pressure. We’re personally also under a lot of pressure to make sure that we deliver on the high levels of customer satisfaction for our online services. And, unfortunately, a lot of us fall in the trap of trying to support these initiatives by just kind of running around monitoring everything and just taking the interrupt approach to making sure that our services are just running, let alone adhering to these high-level of SLAs that we’re trying to adhere to. And so, today, we want to just talk about a little bit about our own experiences and experiences of a number of our customers on what approach you should take or should consider taking in order to support this high level of SLAs to make sure that you can do this in a more systematic approach and not in a kind of a hurried, interrupt, ad hoc approach. And also help you to create a path for continuous delivery and making sure that we take a look at your toolset and your business needs to make sure that it’s a nice, streamlined process.
Chris Churilo 02:49.669 So today, what we’ll do in our agenda is we’re going to review what we’re calling The Five Steps, which we’re hoping will help you get controlled visibility into your environment so that you can collect and instrument everything but also you can do this in a nice, streamlined process. We’ll talk a little bit about what InfluxData is, and then I think the highlight is really sharing with you our experiences, as well as some of the experiences of our customers on where they maybe fell short on hitting their SLAs, what are some of the things that they did to take a step back, look at their business initiatives, try to understand how as a company, they can take a look at what to collect, how to collect, how it actually will have a good impact on these initiatives, and how they’re actually proceeding forward. And then, of course, we’ll end our webinar today with a Q&A session. So let us pursue an introduction. So Tim Hall, I’ll let you introduce yourself real quick.
Tim Hall 03:50.840 Hey, folks. It’s Tim Hall. I’m the Head of Products here at InfluxData.
Chris Churilo 03:55.071 And Tim has, of course, the responsibility of managing our engineering roadmaps, making sure that we can deliver on our products. Another thing that he’s responsible for, and I’m partially responsible for, is that we have on online version of InfluxDB Enterprise which we called InfluxDB Cloud. And we have a pretty big SLA attached to our heads making sure that we can keep that service up and running as smoothly as possible. And so a lot of the experiences that Tim’s going to be talking about today will actually be from monitoring our own online services. And my name is Chris Churilo and I’m the Director of Product Marketing. And for a number of years, I’ve been a line of business owner for a number of SaaS companies. And so in the early days, we were just about monitoring just basic things. Is the network up? Are these databases working? But over time, I learned very quickly that in order to make sure that we actually make these human beings which we call our customers and our users happy, we had to monitor a lot more. And we actually had to make sure that we had a conversation with all the stakeholders involved on these online services to make sure we understood what was relevant, what is important to actually accomplish the goal. I’ll be sharing a number of my experiences as well, as well as the experiences of our customers.
Tim Hall 05:20.267 So some background on the company. Founded in 2013, and really, the focus is on building a modern and Open Source Platform for metrics and events. And this primarily stemmed from attempting to solve problems related to gathering lots of monitoring and telemetry data and trying to put it in more traditional sources, so be they relational databases, and then there was attempts made to use SQL solutions, and Hadoop infrastructure, and other things. And what our founder Paul Dix arrived at was that there really wasn’t a great way to both rapidly ingest and egress data that’s focused on telemetry, and in particular for DevOps monitoring. So the guiding principles that he set out were we wanted time to awesome. The speed at which we could install, deploy, and get value out of the platform that we’ve built needs to be as fast as possible. We also wanted to be easier for developers to consume and scale out, and again, get that value out of the platform and the infrastructure as quickly as possible.
Tim Hall 06:31.471 Just as a side note, yeah, coming from—I spent three years at a product company focused on Hadoop. And the time to awesome is a little bit lower there [laughter]. When I first got involved with InfluxData products, I was able to install and configure them on my laptop in five minutes, and that included the download time. And get up and running very, very rapidly. And that was super impressive to me and one of the reasons why I wanted to join Paul and the team here. So again, software is open source, available for download, and you can sort of play with the Getting Started guides immediately and see what we’re talking about.
Tim Hall 07:12.467 And so, since 2013, we’ve got about 70,000 active servers reporting back to us in terms of folks that are using it. We’ve been able to achieve, through the open source community that’s been sort of built up around this, over 11,000 GitHub stars, which are votes for the tech, which we appreciate. So, for those of you that are familiar with InfluxData, thank you for participating with us in the community. And we’ve been able to get over 300 paying customers right across both our enterprise customers who are buying the enterprise edition of the product, which is based on the open
source core and allows you to do horizontal scaleout and high availability. And then that number also includes the folks that we’re supporting on the platform as a service offering that we call InfluxDB Cloud. Those same enterprise bids deployed and dedicated to individual customers and running on the Amazon EC2 infrastructure. We run that on their behalf. And so, one of the nice things is we get to drink our own champagne. We actually use the InfluxData platform to manage, monitor, and provide telemetry on how those instances are running in EC2. And we use that, including doing alerting, and notifying us in terms of what’s happening. And we’re going to talk a little bit more on how we got to where we are, as well as where our customers are on this sort of maturity model and nirvana of DevOps monitoring.
Chris Churilo 08:41.664 So we just want to give a little more background on why we think it’s really important to start to take the systematic approach now. And the key thing is that we’re actually seeing in the technology landscape that there’s actually three huge changes that are happening, that are going to impact your monitoring efforts quite severely. And so, the first turn that we see is we see a lot of our customers that are actually moving towards a microservices architecture, which is fantastic because it really supports your agile and your continuous delivery initiative. But down the flipside, that means there’s even more moving parts. Which means that there are more things that you have to monitor in order to make sure that you understand that things are moving correctly, working correctly, or help to determine what’s actually slowing something down. So instead of in the old days, where it was just a couple of servers and databases, now we have tons of little things that you need to monitor to make sure things are working.
Tim Hall 09:39.128 Well [laughter], the funny story on that is even in the old days you’d get on a 40-person bridge call when we’re trying to figure out why something was wrong with the website. So imagine now, with the delivery of microservices, the increased complexity in terms of what potentially can gone wrong. Of course, in the old days, the problem was always someone had misconfigured a router, or database had run out of disk space. So those folks got called out fairly early to go check those simple pieces of the infrastructure. However now, with the delivery and the proliferation of microservices, you can just imagine how complex those calls are going to become. So, getting that telemetry out, and sort of understanding how you’re going to instrument your microservices calls, the infrastructure they’re running on, and the interaction of them, that’s super-important.
Chris Churilo 10:27.907 That’s actually right. And in addition, people are finally living up to some of the dreams that continuous delivery—are able to actually do rollout of a new code on a daily basis. We do know that some of our customers aren’t quite there yet, they’re still on a weekly basis, but it just adds to the complexity, and they need to monitor even more things. So, the second big trend that we see is the movements from—I don’t want to say mainframe, that was a long, long time ago—but definitely, it’s from VM to containers. We see a lot of people that are looking at containers or have adopted containers because, once again, helps in their continuous delivery efforts. But another thing that is important to know is you also need to be able to monitor how well the containers are doing, and how well all the little bits are doing within the containers. You want to make sure you can maximize the use of these containers. You also want to make sure that you can understand what’s going on within those containers, and if you need to scale up, scale down, or take a really deep look at what’s going on to ensure the performance of your applications.
Tim Hall 11:33.922 I think the challenge, Chris, on this item, in particular, is everybody’s now moving from sort of VMware platform to Docker. Docker’s being used more and more prolifically with container orchestration platform be that Kubernetes—now we’re seeing people use Mesosphere, and so on and so forth. And the question is, how are you monitoring the ephemeral arrival and departure of all of those containers within your infrastructure. And the bulk of sort of data center operations tooling that exists today from sort of the legacy vendors if you will. If you go all the way back to IBM Tivoli, or HP OpenView days. Even if you go to the newer platforms that are focused on application performance monitoring, they’re not really good at understanding the nature of the container or the container orchestration layers. And so, this provides an opportunity for you—again as Chris was saying, take a step back and let’s look at how the adoption of these new pieces of infrastructure impact the applications that you’re delivering, and what telemetry that you need to understand their health and availability.
Chris Churilo 12:43.617 Right. And then the third trend that we see—and you may or may not see it in a DevOps world, is we’re seeing a lot of little IoT devices all over the world. So, there are a ton of sensors up that are out there that are deployed so they can collect data from things like solar panels or from equipment in a factory, cars, wearables, etc., etc.
Tim Hall 13:07.480 In agriculture, we’re getting a bunch of folks in the farming industry and industrial installation of plants and things. Now it’s starting to be censored.
Chris Churilo 13:16.350 That’s right. And so, we’re actually seeing some operations teams are starting to get pulled into the monitoring of those devices. And so, if you think microservices in containers is a lot to monitor, just imagine all these sensors that are out there. Because we’re talking about fields of corn [laughter]. We’re talking about lots and lots of installations of solar panels. We’re not talking about just like three or four. And they need to make sure that these things are running efficiently. They’re collecting the right kind of data. In addition to just collecting the data, also how you slice and dice that data in real time is becoming a real big nightmare for some of these vendors out there. So, for example with the solar panels, you may have these solar panels across a wide geographic range and you’re going to want to be able to understand how much energy is being produced. Not only across all these panels but maybe at the city level. At the state level. County levels. And there are various reasons why you’re going to need to be able to see that data. And being able to pull that real-time is really important.
Tim Hall 14:24.939 One of the really interesting things for those here in North America might remember the eclipse from a couple of weeks ago. I actually got the opportunity to sit with one of our customers who is using solar panels for energy generation storage. And they actually showed me the graphs from a couple of the panels that were running on the day of the eclipse. And you can definitely see the noticeable dip in terms of the energy generation. So, it was really cool to see our tech used in that particular way.
Chris Churilo 14:52.949 So we’re going to now just focus in on some of the more specific challenges that our customers face when it comes to monitoring. And I think we’re probably all in this kind of a situation where we have lots of different groups that are monitoring different things—networks, servers, their own applications. In some cases, there are some applications that are very specific point solutions to monitor those things. In some cases, they might be using solutions that can actually look at kind of the full stack of the application if you will. But in any event, the big challenge that our customers tell us time and time again is that there are so many solutions within their organization. And trying to understand: which ones should they continue to keep? Which ones should they just abandon? Is there a way where they can kind of bring them together? It’s a really big issue. And they’re a little bit reluctant to drop things, but they know that they’ve got to in order to make sure that they can scale.
Chris Churilo 15:53.314 In addition, there are a number of limitations that they’re faced with some of these solutions. So being able to create the aggregated views of the data across all these different solutions. In some cases, they use the fast version, and so there might be some restrictions on the data retention policy. They can’t read, they can’t set them, or maybe it’s only set to one particular policy. And then the third thing is sometimes, the metrics that are collected are locked down, and they’re not able to pull those metrics. And the reason that they want to pull those metrics into, maybe a different system is, sure, it’s great, so you have those metrics in a solution to monitor the health of the systems at this given moment in time. Well, you want to then be able to pull that data out and start looking at it from a historical perspective, to then determine, where should we go going forward? What are some changes we need to influence based on this data? And having this data locked down is not making it possible for them to do that. Another thing is capacity planning is something that we are always being bugged by our respective vice presidents or our finance team. What are we actually going to do to make sure that we can service our customers appropriately? What stuff do we have to buy to make sure that our service is running really well? And I think we can all agree that is still a really manual process. I think there’s just a few customers that are starting to get out of that, but it’s because the challenges that we face that we talked about before and on this slide as well.
Tim Hall 17:28.212 And then understanding, what lead time, I can meet that past those change in capacity needs. Right? So, if folks are adopting cloud infrastructure, your ability to make these decisions more quickly is pretty easy. But for those that are still purchasing into playing hardware and equipment on-prem, you better build in that 68 week like furniture-building lead time for it to buy, rack and stack, and get everything going. So you’d definitely see some changes there. And I think one other thing Chris mentioned, which is, maybe not technical but maybe organizational, is there’s a real seat change going on in terms of who is doing the monitoring. Right? That’s the whole rise of DevOps. Right? So the fact that we have people who have been sort of data center operations, many of you are probably involved in the seat change now, which is as you build and/or deploying applications, you are getting more of the responsibility to monitor instruments, and maintain those things in production ways that that we didn’t have in IT before. And I think that’s one of the important factors here in terms of taking a step back and looking at the existing solutions and how they fit with trends like agile delivery troops to newest integration and deployment. And then who is actually doing the monitoring? What they need? How do the—?
Chris Churilo 18:40.930 Yeah, yeah. In fact, now that’s working some conversation that I’ve had with our customers where they’re excited that their development team is monitoring their own code, but then they were surprised, and they’ve now realized they shouldn’t have been surprised that how these components were being monitored, how they’re also being presented to the team is not all going to be consistent. It’s all going to be different. So great, you put the keys in the hand of everybody, but then they’re basically locking and unlocking the door all very differently. And then, finally, I think a lot of ops teams don’t want to be the bottleneck, and especially when it comes to trying to determine what are some of the new metrics that we need to start looking at. We described some of the metrics and some of the components that a lot of our customers need to start looking at, but there are also many other organizations within any company that has an online service that are seeking to see if they can start tracking different things to get that competitive edge. And sometimes, collecting your metrics can be time-consuming. It’s not as obvious or easy as it was in the old. It works like, ” Oh, I just want to check for latency,” or “I just want to check for dispute.” You have to kind of start to look at a combination to use metrics and determine, “Okay, maybe I need to look at something else.” So one example that we’ve heard from our customers is that they want to be able to talk to the private manager about how many times a particular function is being called within their code. And they believe that that could determine if that’s the bottleneck in their software. So it’s a totally different kind of a metric that they’re gathering. And you really have to start to work together to determine what are those things that we need to look at and make sure that our service is performant.
Chris Churilo 20:32.119 All right. So now that we’ve done a good job of scaring everybody [laughter]. All the work that we’re faced with, we wanted to—we want everyone to kind of step back and take a look at how can we avoid getting stuck in that interrupt of a mode of constantly firefighting. And so Tim and I sat down. We thought about what makes sense for all of us to follow to make sure that we stay out of that firefighting mode. And we came up with what we call “monitoring maturity model,” where we broking up into five different steps. So you can see here we have—the first step is just collecting. And then you’d be able to take that data, correlate it, and triage it. From there, hopefully, you can then start to identify trends so that you can go back and maybe collect more or change things. And then, eventually, you can use the trend information to notify someone so they can fix something. And then, of course, our nirvana. We all want to have this awesome ability to predict the future, the “what if” analysis. And so I think at first blush everybody probably thinks, “Hey, yeah, we do all that stuff today. What’s the big deal here?” Well, what we believe is that we do all do some of these things at a very probably superficial level. But once you start to dig in a little bit more, you start to realize that maybe we can do a little bit better job. Or maybe there’s things that we hadn’t considered. Or maybe we’re really actually stuck more in that kind of collect and correlate, and triage stage. So when we took a step back we started to realize, “What’s the purpose of monitoring anyways any of these DevOps items?”
Chris Churilo 22:12.926 And so we came up with a list of six things. So the first thing is just being able to reduce the risk by closing the visibility gaps that you have between the different systems. Another thing that is really important to remember for the purpose of this is that you want to make sure that we, as staff members supporting these solutions, become a lot more sufficient and also help to eliminate human error. I wish that was completely gone, but I know just in our experience in just the last couple weeks, that, unfortunately, is not the case. But we want to make sure that that is the purpose, and so understanding that is going to help us to make sure that we collect and do the right things to achieve that. A third thing is—of course, our finance people are always getting on our case about decreasing Capex and Opex. Or at least making it so that it’s something that they can understand what to expect going forward. And does it fit the business model?
Tim Hall 23:09.086 How does that tie in with the business goals and objectives? They’re okay if you’re increasing revenue and sort of keeping your operational costs low as a factor of that revenue creation. But of course, they want to know what is the impact. So if you double your customer base, does that mean you have to double what you have deployed as IT infrastructure? Right? And how are you going to know that? Do you have monitoring and telemetry data that will tell you that for the systems that you’ve deployed?
Chris Churilo 23:35.530 That’s right. Do you want to go over some of these steps?
Tim Hall 23:37.037 Sure. Yeah. And other things—as you think about the goals of what it is you deliver as a business. Right? Lowering the impact of performance issues on customers. How can you identify those early and ensure that your customers, whether they’re internal employees or externally-facing folks, can continue to leverage the applications and services that you’re providing? Another big one from a sales and a finance perspective is churn. So how do you know what you’ve deployed is impacting your customer churn? And do you have it identified that in terms of visibility? And what data are you collecting that will tell you that? And again, so for a lot of you who are thinking, “Oh, I thought we were going to talk about like CPU and memory monitoring.” What we’re trying to encourage you to think about is, what are the business goals or the things that you’re trying to deliver? And then how do you work backwards from there when you start thinking about what data you’re collecting? Yeah. At some levels, CPU, the network traffic—all of those things are important, but how are you going to use them in that second step that Chris has got outlined here around correlate and triage? How does that information allow you to understand sort of these purposes and goals so that you can provide a better handle on controlling your infrastructure performance, and how that impacts the business goals and objectives that you’re delivering?
Chris Churilo 24:50.793 Yeah. Absolutely. I mean, the example that I can give everybody is that I remember going into the ops team at a previous staff center that I was working at, and they were really proud of some new dashboard that they made for me. Because they wanted to prove to me, “Oh, great. Look, we’re monitoring the network performance. Isn’t it doing spectacularly?” And I’m like, “That’s wonderful. However, I have this support guy yelling at me because customers can’t log in.” So clearly, it’s not a network issue. But we needed to—it’s not just about, “Great. That network is working.” We needed to make sure that we instrumented our services so that we could understand when a customer would possibly have an issue before they actually turned it into support care.
Tim Hall 25:31.657 How do we avoid the customer telling us that there’s a problem with the service itself [laughter] is the ultimate goal. And especially, if you’re being asked to take on more responsibility from a DevOps perspective, your code and what you deployed, you’d better be thinking about what your thing does and how you can identify its help early in the process.
Chris Churilo 25:49.608 That’s right.
Tim Hall 25:49.734 So that leads us right back to the collection thing.
Chris Churilo 25:51.787 Right, right. And so in that particular example, what we’re able to do is then have a conversation about, “Okay. What is it that we really should be doing here?” And then we started to take a look again about all the metrics that we were collecting to make sure we have the right combination, at least for that time period of stuff to determine can people login and do something?
Tim Hall 26:11.849 And even on the solar example that I gave. Right? So if you look at the stages of the maturity model, the collection we sort of focused on here and many of us are familiar with collecting individual metrics. How do you put them together? And how do you understand what business goal that they’re helping you to understand and achieve the trend analysis? Right? So are you trending in a healthy way so that your password reset requests—if you’re getting a lot of those, why are you getting that? Right? So is it because of you have a performance issue somewhere else? Are you even monitoring the fact that your users are trying to do a password reset in terms of the number of accounts? But as you’re sort of going—and then, are you notifying that that’s a thing that you would like to be told about before it shows up in your support queue? And then last but not least again, on the solar projection is looking at things like, is the efficiency of the solar panel steady? Or can you predict the rate at which its generating power is sort of degrading over time? And at what point would you want to send out a service tech? Right? So I want to be able to predict—I don’t want to just send out somebody on a scheduled basis, like every 30 days they need to go and wipe the solar panel down. I want to know when to send somebody out and I can build a schedule based on predictive information that you’re gathering out of these. But that’s the power of collecting the right information, correlating it together, identifying that trend, and then notifying who you want to take action. But also then say, “Hey, I can actually optimize the schedule routes for my service personnel, for example, if you’re working in that kind of environment. Or how about just looking at things like disk space. I can predict when I’m going to run out of disk space if I see these trends occurring, and then you can be prepared for those things a little bit better to make sure that you’re not getting service interruption.” So I think that’s critically important. So hopefully that ties together those five different stages of model, and I’m sure many of you are at various points of maturity on each of those things. But what we’re going to do now is take a step into each stage and really sort of highlight what we think their critical points are here in terms of taking that step back that Chris mentioned as we’re contemplating achieving nirvana for DevOps monitoring.
Chris Churilo 28:26.665 So, the first step, of course, is just collect data. We’ve been doing this forever. We all know you need to collect data in order to understand how your systems are performing. You guys get it. We all get it. But one of the things that we recommend that we do is take inventory of what you’re collecting today, and have a frank conversation with your colleagues about, “Are we collecting the right stuff or is there stuff that we don’t need?” And really try to understand them with the second point, “Is that data that we’re collecting going to help us to achieve our goal?” And work with the line of business or even your executive team to determine, what is the goal for our services? And they may look at you like that’s such an obvious thing that everyone should just know, but once you start to dig in with your exec team on that, then they start to recognize that actually, just having a service that’s performant is not a goal, that’s just way too high level. But digging deeper into—does that mean that people can log in? Does it mean they can download the reports? Is the search capability performant at whatever metric that you assign to that? Laying those things out will then really help the business side understand what we’re trying to achieve together, and then it will also help us understand the right things that we are collecting.
Tim Hall 29:49.834 And you’ve really got to interrogate that inventory of metrics. Right? Interrogate it, “Why are we actually collecting this?” Have you ever seen a failure in X? Is that really important? It’s great that you can collect it, and this goes back to Chris’ example of all the dashboards she was shown. Right? It’s great that they could collect it, but they were missing the whole thing. They were missing the fact that users were struggling. So that interrogation is important. Why are you collecting it? And how does that fit into the goal?
Chris Churilo 30:14.422 And then the next thing that we recommend everyone do is looking at how you’re actually collecting that information. Of course, if you know what you want to collect and you know that these things are existing on various servers, or wherever they are, that’s pretty easy. But there may be times where you don’t know, and so this is where we also recommend that you consider both sides of the metric collection spectrum. Whether you’re pulling data or pushing data—because there may be instances where you just need help in determining what are the metrics that are out there and available to you that you can potentially use to hit your goal.
Tim Hall 30:54.063 There’s been a lot of debate, I would say, in the monitoring industry, if you will, around push versus pull metric collection and the answer simply is both. And we say both because, for example, when we’re talking about Docker containers and getting telemetry and the ephemeral nature of those things, a push model for that may make way more sense. Or, sorry, a pull model may make way more sense, where essentially the container itself is registered and makes itself available, and then you pull them. Right? And given the ephemeral nature of those things that are appearing, they’re disappearing on a fairly regular basis, having a static configuration that’s trying to find things that are appearing, disappearing is not the way to go. Right? So. However, then on the flip side, if you’re trying to do a pull style across various firewall boundaries, that’s not going to work. Right? And so if you’re trying to integrate your DevOps infrastructure between cloud and sort of on-prem solutions, or you’re working with a business partner on a process integration, you may want a push style, right, which can more easily traverse those technological barriers and boundaries. And so the answer that we’ve come to at InfluxData is both of these are valuable approaches and it’s about using the right approach to solve the problem based on the kind of things that you’re dealing with. So there’s no religion here about one versus the other. The answer is all.
Chris Churilo 32:33.314 And then ask yourself the question of what else are we missing. Right? Now that you understand what the goals are, you have the inventory of what you’re collecting, what are some other things that you should start considering to collect? And I think it’s a question we have to ask ourselves continuously. Right? This thing is a moving target, unfortunately. Keeps us employed, keeps it interesting as well. Okay. So now that you’ve collected the data then, of course, you want to be able to correlate that data to understand. Is there an issue? And if there is an issue to triage that to address that and make sure that things are working just great. But one thing that we have to warn everybody, and you probably know this from your experience is that the data comes in at all kinds of different frequencies, timeframes, of course, it comes from different sources. So sometimes, there’s going to be some work to normalize if you will, the data, so that you can actually understand it and do some comparisons against the different metrics that are coming in.
Chris Churilo 33:32.791 And then another thing, kind of the entry level of the step is you definitely want to make sure that you understand what are the baselines for just basic service availability. Lay that out with the team, make sure that you actually have the right kind of data and the right understanding of the data so you can establish that baseline. And this is just at that kind of very superficial level. Of course, once you—you want to then take that to the next level and go beyond just basic availability, and try to make sure that we can offer a very performing solution to our customers.
Tim Hall 34:06.834 You don’t want this to be a report where you find out like 30 days later that your service is down for a month. That’s a failure [laughter]. Right? But I think going back first to the what are you missing, at every stage asking. So in order to achieve the goal and get that visibility what are we missing from a data perspective and how did that fill in the picture as we’re trying to correlate up to things that everybody can talk about consistently? Right? Users are successful logging in. Users are successful getting access to the information and capabilities that you’re attempting to expose. Every time you deploy any microservice, how does that fit in with the landscape of what you already have? And so asking that question, “What are we missing? How does this fit in? How does this fit in with the goals?” —both from the collect and now correlate triages is becoming hypercritical.
Chris Churilo 34:57.827 So now that we’ve established this baseline, then what we want to do is we want to get smarter, and of course, we want to be able to identify some trends so we can get ahead of it. Nothing is worse than getting a ticket from a customer to tell you things aren’t going well. We always want to try to start to get to that point where based on historical data. We can kind of forecast that when things are going to go sideways. So if we know that a bunch of people are going to be logging onto a server at 8:00 AM on a Monday to download their reports, we should look at that and we should determine, “Is there enough capacity there so it can handle all that?” Because during the rest of the week maybe we just have a few customers logging into our solution, but we should definitely look at historical data to then establish, “What are the trends? Are we going to be okay? Let’s get ahead of this before the customers even complain.”
Tim Hall 35:45.583 What is normal? I mean, that’s really a question. Right? What’s a normal day look like? And then, do you have any examples of what an abnormal day looks like? Right? That’s how you want to decide on what you’re going to alert on, who you’re going to notify, which will come to you in the next stage, but you’ve got to establish those baselines.
Chris Churilo 36:02.723 And I think that goes right into that third point of managing your inventory. And we’re not talking about the inventory of the stuff that you’re selling to your customers. This is your infrastructure. Right? So do you understand what are those safety thresholds for each of the components that could potentially impact the service offering?
Tim Hall 36:20.291 Yeah. Particularly, folks are again as we mentioned adopting more of container and container orchestration. Right? You have some flexibility to move things around a little more dynamically, or if you’re working in a cloud infrastructure. Right? You have the ability to spin up more. So your safety levels can be a little bit more dynamic in terms of what you know you can sustain. But do you know how long it will take you to do an elastic expansion of your services? Again, you got a microservices architecture deployed. You’ve got a tons of doctor containers. You’ve got lots of hardware infrastructure at your disposal. How long will it take you to do—will it do it dynamically? What are the metrics that you’re collecting that would drive that kind of expansion? One of the things that we did recently is have a webinar on Kubernetes Pod Autoscaling using the combination of the InfluxData platform. And if you haven’t had a chance to look at that, it really is sort of enlightening for folks who—Kubernetes offers you one way in which you can scale based on CPU. But if there are other metrics that are indicative of how the business is running, and that may change your mind in terms of, “Look, you can’t just monitor CPU as the only factor in determining when and how you want to scale out your micro-services architecture using these sort of container approaches.” So, that’s what we’re getting at in managing your inventory. Think of it like physical inventory but from a computed resource perspective. How long is it going to take you? Do you know where your safety zones are? Do you know what normal looks like? And that trend analysis is critical in getting that under your wing.
Chris Churilo 37:59.561 I think another example that we’d have to face first hand—well, mostly Tim—is that when our customers are monitoring a bunch of stuff and sometimes they don’t realize that they’re going to hit the storage limits of the solutions that they bought from us. Because all of a sudden, they’re monitoring even more stuff. And so, the support team had a couple of tickets letting us know that some customers actually got themselves into a bit of a pickle because all of a sudden, they were going way over the storage levels. And I think that’s very typical to kind of get into that mode of trying to resolve that customer’s issue by panicking, opening a bunch of tickets internally, yelling at Tim. But what Tim and the team was able to do was actually take a step back and have a conversation of, “What can we do as a business to help alleviate that or minimize that?” And so, then he was able to then look at the pricing structure and come up with a new offering that basically allows them to bump up their storage without having a negative impact on their collection efforts. And the price was very reasonable. And it just was a nice insurance policy. Right? And so, I think this is something that’s important to remember that you can go back and talk to your business line managers and have that conversation so that we can prevent these kind of horrible things from happening.
Chris Churilo 39:28.185 Next step is on notifying and taking some action. And I have to say that a lot of us are still in that manual mode. So we get a notification that something’s not right, and then a team will go and address it. They get a page and try to take care of it. And there’s some automation that’s also happening. Tim’s talked about spinning up containers automatically making sure there’s enough resources. But we’re always asked to do things even more—in a more automated fashion, in a more faster way than ever before. And the only way we can—to get to that point is that we do take a look at this landscape. And we start to understand where can we add more automation and not just add it in there for the sake of adding it.
Tim Hall 40:14.254 And going back to the notion of collection. Right? If you’re going to notify and act, sure. If you want to get a human involved, they can do a lot of processing. Right? And analysis on top of the information you’ve gathered. But if you want to go to the automation phase, now the question is, how do you gather the right telemetry that gives you consistent answer that the machine can operate rules against. And so, again, knowing who you’re going to notify. You’re going to notify an automated process or you’re going to notify a person. The desire to move up in terms of that maturity model curve is to ship from manual to automation, and of course, everyone wants to do it faster. So those are the steps.
Chris Churilo 40:56.073 And then of course five, where we’re all trying to get to is the stability to have a much more automated way of starting to do that “what if” analysis. We don’t have to manually collect all that information. And I think there’s very few people that are at that optimal level. But I think if you don’t really go through those four steps previously in a methodical way, it’s going to be very difficult to get to this point without being always in a manual mode. And I think it’s worth having a conversation with your executive team about this. “Look, all that firefighting is great. Yeah. Sure. We’re helping customers in the short-term, but really what we need to all get collectively is to work out how can we do this so that we can do a much better job predicting how much our services are going to cost the business.”
Tim Hall 41:44.725 Let me go back to the storage option that you mentioned Chris for InfluxDB Cloud. One of the things that we can do for customers now. Right? Once a new customer onboards in InfluxDB Cloud, right, they’re buying a particular subscription of the service. That subscription is allocated a certain amount of storage. And one of the things I’d like to do is predict when it is that they’re going to run out of that storage space. After they’ve been running for 30 days, I’ve collected the telemetry information on their instance. I can create that base line. I can actually notify someone if they’re in plan or out of plan, but I can also now make predictions. And the prediction I’d like to make is based on their current rate of ingest, if that remains constant, when will they run out of disk space. And in some cases, customers are like, “Oh. I didn’t realize that, number one. And I would like from a cost threshold perspective to stay where I am today.” And so that prediction, we can go out and engage now with the customer and have that conversation, “Hey, we’re going to predict you’re going to run out of your storage in another 30 days. We want to make you aware of that because you have an option A to buy more storage as you get there. Or B if you want to maintain that current cost threshold, you may want to look at your retention policy ahead of time. And so that’s the predictive power here. Right? How do you figure out what it is you’re doing? When it’s going to happen and make the right business decisions for you based on what kind of results you’re driving.
Chris Churilo 43:09.745 And what’s cool is it ends up being a really great customer staffing, as well, because the customers appreciate that. We’re seeing them. We’re notifying them early enough before it becomes a problem, and giving them the option to either stay at their cost levels and then just drop data or upgrade and just continue to go forward.
Tim Hall 43:29.092 Yeah. Proactive support’s way better than reactive [laughter] support, as it turns out, so.
Chris Churilo 43:34.558 So we’re going to talk a little bit about InfluxData, and then we’ll go into some of these customer use cases for the last 15 minutes. Do you want to go—?
Tim Hall 43:43.751 Yeah. Again, so from company perspective, right, we went, we set out to build a platform to address these new workload requirements. We realized that to gather all the telemetry information, you need metrics and events at an extraordinarily high volume. We know that you’re going to be dealing with sort of regular time series which is collection for the standard intervals, but now everyone is dealing with irregular events as well. And those occur whenever they care to and usually at a very inopportune moment. So being able to deal with both regular and irregular series is something we set out to solve. The ingest is just half the problem, as we mentioned with the storage side of things. Setting retention policies. So when do we want data to exit the system to continue to maintain performance and availability? If you don’t need to query high-precision data over the course of the year, that’s usually not done from a real-time sort of monitoring perspective. You need to be able to get rid of that data too. Right? So we definitely contemplated those three areas. We really focused the platform on delivering query performance based on time and delivering specific functions to allow you to do aggregation and summation using time-based functions directly in a SQL-like language, and allowing for large-range scans of many records very, very quickly. And the way that we’ve done that is we’re ordering ranking and doing limits within the query language itself, and the storage engine that supports those queries to deliver these answers very, very quickly back to you. And of course, the big focus on scalability and availability. So the Enterprise edition, again, is distributed multi-instance deployment to allow you to have no single point of failure. And the result is a fast, consistent platform for dealing with all the metrics and events that you’ve got, and hopefully always available. Again, based on monitoring it and ensuring that you’ve got the resources available to run it.
Tim Hall 45:53.856 So here’s a quick view of the platform for those that may not be familiar with the way we contemplate things. We break up what happens in terms of three different and discrete steps. An accumulation of data coming from all the normal sources, IP and sensors and systems, applications, message queues. Those come in the form of regular and irregular metrics and events. And we provide the means to gather those through a number of different ways. And we’ll get into that in a minute. The analysis. Right? So you want to be able to write queries to interrogate the data that has been collected, and then take action. And that action can be notifications driving automation like the Kubernetes Pod Autoscaling example we gave, and then providing visualization for people to actually understand what’s happening. Establish those baselines and do the trend analysis. And again, Enterprise is the Enterprise edition. We also run that in Cloud. And those are the commercial offerings of the platform.
Chris Churilo 46:58.591 Okay. So we’re just going to go over a couple of customer use cases. And it was after talking to these customers and also looking at our own experiences that we were able to come up with that maturity model that I mentioned earlier. So Nordstrom is the customer, and they have a very important event every year. They have their annual sale. And probably most of their revenue actually gets generated from that annual sale. And one of the good things that they’ve done is they have really rallied behind that to make sure that they think about what could actually go wrong from a service availability perspective, and make sure that they use that, then to understand, “What are the things that we should do? What are the things that we should collect from a monitoring perspective? What are the things that we should—how should we react to some of those situations?” And also what can they do to predict going forward—because this is a once-a-year event, it’s going to always happen—to make sure that they can maximize the revenue from that opportunity as much as possible.
Tim Hall 48:06.495 And I think it goes back to what we’re saying about the goals of the business. How to maintain health of it, and then making sure that they’ve got the right instrumentation and the ability to ensure that those goals are being met. So that’s a great example of—they’ve moved up in terms of maturity from both collection and in terms of correlate and triage. And now they’re really investing heavily on the baselining and the notifying and action layers. I think prediction is probably next. And again, I think there are a lot of customers that that’s where they want to get to. But they’re having to establish best practices sort of in those earlier stages of the maturity model before they can get there.
Chris Churilo 48:46.103 And they have a pretty large development team, and so they have a lot of people kind of managing their own metrics. But having all those teams now rally around a single goal has really helped them to make sure that things will be a little bit easier during that time. Another customer is a company called Coupa who basically has a fast platform for helping companies with their spend management. So if you have a marketing team that’s buying a bunch of ads, or you have a manufacturing team that’s buying just a bunch of components, or equipment, etc., on being able to understand the spend and basically unify that business buying process across the company. So for them, even the simple act of buying a pen, you want to make sure that that system is up and running all the time. And Sanket is the VP of cloud and security at Coupa, and he had the foresight to really kind of stop the business from just going into that 100% ad hoc mode of just making sure they can hit their SLAs. And what they did was, they actually took a really deep inventory of all the different things that they were collecting. They looked at all the different monitoring solutions that they had internally. And they realized that there was actually a lot of work that they needed to do. They started to realize that maybe there were a lot of collection services that was just used for ad hoc purposes without thinking about the company goal in mind. They’re really trying to hit that 100% uptime. So looking at that, taking that inventory, and then refactoring their entire approach was pretty critical for them to be able to then take that breath and do things in a much more systematic way.
Chris Churilo 50:36.365 The other thing is, every year, just like any company, they wanted to do their forecasting for the following year. And it probably took them a couple of months to just do the data collection for that forecasting, and he recognized that that just was just such a big waste of time. And it wasn’t until that he really had that conversation with his team about that, and what they could potentially do to end that manual collection process, was that they were able to start to get out of that ad hoc process and do things more systematically. And then the final customer that I want to share with you is a company called Vonage, who is basically an online call center. And so, let’s say that you want to build a support team and you want to have a bunch of phones, some banks ready, you would call Vonage and you could buy that from them. And the interesting thing about their business is that most of their customers are support centers with phones, and you know how we all feel about support. When we’re calling support, it usually isn’t a good thing. So we’re already kind of in a bad mood. And then we’re on a phone, which puts us even in a worse mood, because we’d rather do everything just into a browser online. And then, on top of that, we’re put on hold for a couple of minutes. So we’re really not feeling that great. And then, the moment that that phone drops, we are really cranky. And any time a phone drops with their service, that was a pretty negative impact to their service, and they would get a lot of complaints from people. People, of course, would call back yelling at them, or, unfortunately, we’ve all seen this, they started hitting the Twitterverse with a bunch of complaints.
Chris Churilo 52:23.956 And for a while, unfortunately, the way that they were actually monitoring the health of their system was from Twitter. And there’s nothing worse than having to look at that to determine, “Oh yeah, there’s some kind of a performance issue.” And when they took a step back to try to understand why that was the case—because they were monitoring all their solutions—it turned out that they were actually monitoring or gathering metrics from their solutions at a one-minute interval, and that was just not cutting it. In one minute, that was enough time for somebody to then get dropped from a call, and quickly tweet out something nasty about their company. So, as a company they took a step back, and they looked at what are the numbers that we need to support to really have a great service. And you can see for them, they decided to go with five nines approach, which is pretty aggressive and a pretty big goal, but they—but given their service, I don’t think they had a choice—they really had to make sure that they had that as a goal. But then, they also then took inventory of everything that they were collecting. Also taking inventory of what were the time intervals that they were collecting the metrics, and then determining if that was going to work. And clearly, it didn’t work. They realize after they did that inventory that one-minute intervals wasn’t going to cut it. They actually had to go into, in some cases, at millisecond intervals to really determine when things were not working.
Chris Churilo 53:48.136 In parallel, of course, they were also told by their dev team that they need to go into more of an agile, continuous delivery mode, which the engineers all supported. But after they took inventory of all their systems, they also realized that they had a lot of legacy systems. And so they were a little bit concerned about how were they going to start collecting metrics from these legacy systems? How would they be able to bring those guys forward? You obviously can’t just engineer them into brand new shiny systems overnight—it’s going to take some time. And so they had realized that they needed to make sure that they could have a solution that could be, basically, a plug-and-play mechanism for collecting metrics on these legacy systems to bring them to the future. And then eventually when they had the bandwidth to update those systems, then they could possibly plop those into their existing monitoring solutions. But I think if you talk to them you can hear very quickly that they had to take that step back. They were doing a lot of scrambling. And they are now—actually, I think they are on a once-a- day cycle for updating their services from a software perspective. So they’re pretty proud of that having come from a very typical waterfall approach. Just from a couple years ago of three months deployment cycles, they’ve really done a really fantastic job. And they also have some proof to offer to their customers and their exec team that they are able to uphold that five nines guarantee based on what they’ve been doing with the collection of all of these metrics.
Chris Churilo 55:33.381 So, in conclusion, I think the—where I would put our customers in—as far as maturity model goes, I think they’re all striving to get to that predict, but we’re—just—I think there is still more in that kind of identify trends, and notify, and act step, but I think they can get to that final nirvana, because they have recognized that there are still some shortcomings in their solutions. But they know that they’ve set some goals on what they need to implement in order to achieve those goals. And they also have the business buy-in to make sure that everything isn’t just in that kind of ad hoc mode. And I even think, Tim, even us, I mean, of course, we want to be there. We’re probably also only in that midway point for our own cloud service. Of course we want to get there, and sometimes we do fall into that fifth step, but I think, knowing that [laughter] this is the path we want to go to, I think helps us to do a lot better in our own monitoring efforts for our services.
Tim Hall 56:38.962 Yeah. Absolutely. We just introduced as part of the InfluxDB Cloud subscription access to a host conversion of Chronograf for that visualization and administrative access to the instance. And we needed to make sure that we were collecting metrics off of it. And we needed to make sure that the alerts were in place to notify us if those instances went down. And so now the next step will be—well, do we want to predict when the customer will need to upgrade based on volume of usage, the number of dashboards that they’ve built, the number of users that are accessing those instances. And we’ll need to make sure that we’re collecting the right data to do that.
Chris Churilo 57:15.965 So we ask you to come in and take a look at our service. We are open source, so you can download the TICK Stack for free. Also, these customers that we’ve talked about, we do have some very technical reviews—case studies on our website. So it’s not just high-level conversations about why they chose Influx. It really goes deep into the conversation with their developers and their SRE team about how they actually implemented some of those things. So I do recommend that you do take a look at those things on our website. And as I mentioned, you can also download the open source bits for free. You can also try InfluxDB Enterprise. We have a free 14-day trial. And then, feel free to ask us any questions at our community site or on this call in the remaining two minutes.
Charlene O’Hanlon 58:05.503 Right. We do have about two minutes left [laughter]. Thanks for noticing. We have time for maybe one or two questions from the audience. We’ve gotten a couple in here. So, let’s go ahead and dive right into it. The first question is, how do you recommend we should go about consolidating monitoring?
Tim Hall 58:24.965 Good question. So, that comes down to role and responsibility. So what are you responsible for? Start with that. The nice thing that we’ve done is, again, from agentry and a collection perspective. We have mechanisms in the platform for both push and pull. And gathering the data into a single datastore like InfluxDB is really the starting point. Right? So like we said, understand the goals. Interrogate your inventory of what you can collect today. See how those fit together in terms of satisfying those answers or questions that are going to get asked by the business folks, about the health and availability of the thing that you’re attempting to get that telemetry from. And then, you can build dashboards—once those things are pulled together—to identify those trends. But really, it’s a matter of data consolidation—is almost the first stage of this.
Chris Churilo 59:22.453 Yeah. In fact, that’s what—the three customers that we talked about, that’s exactly what they did. When they looked at all the various solutions they had they said, “Let’s start with at least taking the data from those systems and let’s just consolidate them into InfluxDB, so we can have a single pane of glass—” I know it’s a bit over-used word “—into all the data that’s being collected.” And then from there, they were able to then start to determine, “Okay. Do we need all these other solutions? Can we scale back on them? Are they duplicating efforts?”
Charlene O’Hanlon 59:53.635 Okay. Great. Well, unfortunately, we are at the top of the hour. So we’re going to have to not do any more questions for this webinar. But please note that Tim and Chris will actually get your audience questions. And hopefully they’ll be more than happy to follow up with you guys offline. But please do check Devops.com for other webinars that we have coming up. We’ve got a lot of them, and hopefully, there will be at least one or two that catches your eye. And again, as we said at the top of the hour, today’s webinar has been recorded. So if you missed any or most of today’s webinar, please know that you’ll be receiving a link to get to it on demand. Tim and Chris, thank you both for joining me today. I really do appreciate it. It was a great webinar, and hope you guys had a good time.
Chris Churilo 60:48.661 Yeah. Thank you.
Charlene O’Hanlon 60:49.746 Awesome, awesome. All right. Well, again, this is Charlene O’Hanlin, the moderator for today’s event, and I’m signing off. Have a great day.
Charlene O’Hanlon 60:57.442 Bye-bye.