Coming soon! Our webinar just ended. Check back soon to watch the video.
Many companies are working to define their expectations for the SRE role and the SRE toolchain, which, like the role itself, continues to evolve. The tools SREs use at any given time will depend on where an organization is in their SRE journey. Less mature organizations will tend to use more specialized operations tools while more mature organizations will see more convergence between SRE and software engineering toolchains. So, while it’s certain that there’s no “one-size-fits-all” set of tools, teams will experiment with and adapt with the right tools as they seek new, efficient ways to bring greater reliability to everything they do.
Join VictorOps, Grafana and InfluxData as we explore a couple of the basic tools your SRE team can set up to drive a culture of innovation and uptime.
Attendees can expect to learn:
– Industry expectations around service reliability and availability
– How to create simple and lightweight representations of your systems for everyone in the organization
– How Grafana and VictorOps work together to create a system of engagement
Watch the webinar “Safeguard Your DevOps Transformation: Choosing the Right Tools for Cross-Functional Teams” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Safeguard Your DevOps Transformation: Choosing the Right Tools for Cross-Functional Teams”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Jason Hand 00:00:01.078 Hey, everyone. This is Jason Hand. Thank you for joining us on today’s webinar. We’re going to give folks just another maybe 60 seconds or less to join us while we’re sort of getting things set up. So just sit back and relax, and we’ll get things going here very quickly.
Jason Hand 00:00:37.775 All right. Well I think we’ve got a number of people already joining us, so we’ll go ahead and get started. Thank you again for signing up and joining this webinar. I’ve been excited to put this one together for quite some time. We’ve got some really awesome people on the presentation with us. I’m going to let each of them sort of introduce themselves later when they get to the portion that’s specifically about the different services that they represent. But we’ve got Margo from InfluxData, and we’ve got Leonard Graham from Grafana Labs joining us. And of course, I’m Jason Hand at VictorOps. It’s a pleasure to be able to share sort of the presentation stage with these two. So with that, let’s get right into it.
Jason Hand 00:01:19.481 So one of the things I wanted to outline very early on is, what exactly are you going to get out of this webinar? We kicked around a bunch of different ideas, and how can we really share an interesting story that really helps people understand sort of the basics of the SRE Toolchain. So the first thing we wanted to make sure we cover is let’s level set on the expectations that the industry now kind of has around reliability and availability, and what are the new challenges that have sort of come from that, and the new approaches to answering those questions of, how do we be more reliable? How do we be more available? but at the same time also deliver a really awesome service, a really awesome product or application. The next thing we’re going to do is kind of go through a little bit of the story of how you can create a simple, lightweight representation of your system. How do we get into this whole observability thing where you really are understanding the health of your system, the health of your business, the health of all of the infrastructure and applications in a much more realistic way so that you can make smart decisions moving forward? And then last, we’re going to dive into all of our tools in terms of kind of outlining a real live scenario in terms of the SRE Toolchain. It’s just a very small part of it, obviously, with just the InfluxData and Grafana and VictorOps, but it at least gives you an idea of how teams are starting to think about understanding their systems. And we’ll go through the whole lifecycle of an incident because there really is more to an incident than just that sort of initial alert and that initial incident creation. So we’ll touch on that as well.
Jason Hand 00:02:50.121 One of the things I like to talk about early on whenever I’m discussing SRE with folks and how you get started and how you kind of set things up is, this is a very important realization that VictorOps had a little over a year ago as we started to really explore how we want to address Site Reliability Engineering. How do we want to make VictorOps stand out as the best on-call incident management platform? Well of course we need to be very reliable. People kind of look at us in terms of, what is the best way to stay reliable? They rely on us to let them know when they’re having problems. So reliability is something that we take very, very serious, and it’s also something that a lot of companies have had to take very serious as we’ve sort of shifted into the digital world with applications and mobile and Cloud and whatnot. There’s now these expectations that I can go in and I can do my banking, or I can go in and I can order a pizza, or I can go in and I can do anything I want with the services that I use, and it doesn’t matter what time of day, and it doesn’t matter where I am. That’s the new expectation.
Jason Hand 00:03:49.912 So this, in a sense, is what we mean now when we’re talking about reliability and availability, and there’s a few things that kind of come with that. From a technical standpoint, when it comes to reliability and availability, there has to be a correctness. There has to be this sense of, well the tooling and the application and the services that I am relying on, they just work. And the data is correct, and it’s consistent. And there’s also innovation. A year from now, when I go back to use a tool, it should have improved. It should have new functions and new features and things that really address the problems that I’m trying to solve and how those change over time. And so this has sort of pushed us into this new realm of, well we have some new challenges that we’re faced with. We can’t just do it the way we used to where we delivered software over to an operations or a release team, and then they sort of got it out and running into production, and they managed all that type of stuff. We really have to look at our systems and the way we build and manage and support all of this stuff in a much more holistic way. And you can already kind of see this happening across the industry. I mean, if you show up at any DevOps-related industry around the world, you’ll start to see tools and vendors and lots of people giving talks about how they’re now addressing complex systems and how they’re building into this complexity new functionality, but also being able to understand, what’s the reality of our systems? As we create these more complex scenarios, it becomes a little bit harder to really keep track of what’s going on.
Jason Hand 00:05:22.723 If you think about how our systems have been built over time, now we’re into this new complex distributed model. We’re using the Cloud, using APIs, having all of our different tools kind of stitched together. That’s sort of the Toolchain that we’re talking about, where VictorOps relies on Twilio, and it relies on lots of other services to be able to provide the functionality that we promise. And so we are starting to rely on other services just to be able to deliver on what VictorOps promises. And that’s sort of the world that a lot of us are in now. We’re just creating these very large and complex distributed systems. And so that’s just the new normal. The thing is though is we’re creating these new complex systems and sort of adjusting to what the new normal is. We are running into problems. We’re running into walls where we’re hitting some sort of friction, and things just aren’t moving quite as well as they should, and that’s really what DevOps has kind of promised a solution to for a lot of folks.
Jason Hand 00:06:19.766 So for those of you who aren’t familiar with DevOps, I know there’s a million different explanations and descriptions of what DevOps is, but let’s just real quick kind of cover those because when we talk about SRE and when we talk about the SRE Toolchains and a lot of the different stuff that’s related, people will, of course, want to know, well isn’t that DevOps? And what’s the difference between the two? And there’s actually a really great article that our friends over at Google put out not too long ago that you should go check out. I’ve got a link there. But what they’ve done is they sort of point out some of the key things about DevOps. The first thing that DevOps has set out to do is reduce those silos. We do a really poor job, especially as our companies grow, of staying in touch with each other and really disseminating information and knowledge and making sure that everybody’s aware of what’s really going on within the organization. Who’s working on what, and how’s it coming? What’s in trouble, and what’s working really well? So being able to visualize work and just understand more has been sort of the focus on reducing those organizational silos so that much more information is being shared. Another big part is that when we deploy new functionality and new code out into production, we want to do this in very small, incremental blocks. We want to introduce incremental change rather than these large changes because incremental change is a little bit easier to wrap your head around, and you know what actually changed any time there’s a deployment. It’s much easier to go back and dig into that, as opposed to some sort of large change that maybe comes after some sort of code freeze. It gets a little bit more difficult to understand what’s taking place in our production environment if we’re just always dumping out these really, really large changes.
Jason Hand 00:07:55.823 The next thing that we need to do is level set that the fact is the systems we’re building are very complex, they’re distributed, and they rely on other services themselves. And so there’s going to be failure. That’s just sort of the new norm, and that’s something that we at VictorOps try to coach a lot of companies. You’re never going to avoid failure. You’re never going to avoid outages and incidents. It’s just not possible in the systems that we’re building. And so we have to really understand that normal is just part of it. But that’s okay, because we are getting to a point where we can change and adapt and respond very quickly, and as soon as we detect that there’s some sort of problem, we can swarm to that, and we can make the necessary changes to recover and get things going again.
Jason Hand 00:08:35.652 The next thing is that we really do like to automate and try to use the tooling that we have, especially the tooling that we’re talking about today. We want to try to automate some of the stuff that we do. We realize we have a lot of interesting ideas and things that we want to deliver on, but if we get sort of stuck in the grind of doing manual work over and over again, well that’s very difficult to scale. It’s difficult to scale as a team, it’s difficult to scale as a service. So anywhere we can automate and use tooling to sort of ease our job, we want to do that. And then the last thing I’ll mention, although this is definitely not an exhaustive list of some of the things DevOps tries to approach, is that we want to measure. We want to capture and collect. You may not need that information all of the time. You may not need it just sort of sitting around so that you can stare at dashboards all the time, but when something goes wrong, you will want that information. You will want that data so that you can go dig into it and try to understand what’s happening. So we try to capture and measure and collect as much as possible.
Jason Hand 00:09:34.259 Now, if we switch sides and we talk about SRE a little bit, some of the things that they set out to really kind of communicate is that we want to create shared ownership. We want to make sure that, as developers, we do have empathy towards our end users and towards the people who use that service or who use that value. And so we have to start owning that service, and we are now, as a developer, I’m now one who pushes my code out into production, so I understand that process, and I’m also the one who responds to any trouble with that code or with that service that I deployed out. So there’s this sort of shift in mindset where it’s not so much the development team writing code and then handing it off to someone else along their own little inventory pipeline. It’s a little bit more co-collaborative and co-creation and ownership.
Jason Hand 00:10:26.191 The next thing that I want to talk about is that SRE also understands that if you find a good balance of understanding your problems and your incidents and that kind of stuff with the releases, once you sort of understand those correlations, then you can actually balance those deployments in terms of failures and anything that’s going to go wrong. You can give yourself a little bit of wiggle room because you know now what happens when you deploy. You have certain expectations about what you’re going to see, so it just becomes a little bit easier to understand and release more frequently. Also kind of related to this is that the faster and the better we get at deploying and how much more confidence that kind of comes along with that, well that’s going to actually reduce the costs of those failures. Like I mentioned, you can’t avoid it. You’re never going to outrun problems in your systems. But if you can get really, really good at knowing about that problem, get really good about making sure that the right team gets involved and is notified about that problem, you can dramatically reduce the costs of those failures. The longer there’s a service disruption that just kind of goes unchecked, obviously there’s costs associated with that. So we’re going to try to attack that.
Jason Hand 00:11:34.591 Encouraging the idea of, let’s take whatever it is I do today and automate that stuff so that I can move onto something else. And that kind of speaks to the automation point of DevOps. There’s a lot of folks within SRE organizations who have maybe been trained more traditionally in the DevOps world. So when you slop, or put some sort of sloppy problem that is generally something the operations team is trying to tackle, you give that to a developer, well they start thinking in different ways and they come up with really interesting solutions to solve those problems, which then frees up more time for research and moving onto other interesting processes and technologies. And then last, they believe that operations is a software problem, and that if we define prescripted ways of measuring availability of time, outages, all that kind of stuff, toil, then what we can do is we can take a very pragmatic approach into how we manage our operations, how we manage infrastructure, and that kind of stuff. So really take a little bit more, like I said, pragmatic and measured approach to the infrastructure that sort of runs these applications that we’re building.
Jason Hand 00:12:35.840 If you kind of put these two in contrast with each other, you can see there’s almost a direct relationship between DevOps and SRE in terms of what they’re trying to do. At the end of the day, DevOps and SRE, while not perfectly the exact same things, there’s a lot of overlap here. So you can see that really some companies and teams and organizations who focus on SRE, they really are just adopting some of the basic principles of DevOps. Now with that said, one of the things that I like to make sure people understand is that when we’re talking about DevOps I don’t mean continuous delivery, I don’t mean just automating and measuring. It’s not just the technical stuff, because DevOps isn’t something that’s only for a SaaS company or any of the companies that are on this presentation. DevOps is something that really applies to an entire organization throughout all of the departments. So I like to say that it’s really just an approach to our work where we’re looking for ways to continuously improve the technology, the process, and the people as they relate to how we build, deploy, operate, and all that kind of stuff. Not just an application, but actually you have to think of it in terms of the value. What is the thing that customers come and use your product to do? That’s the value. So you have to understand what they’re trying to accomplish and actually put in efforts to improve the delivery of that value, because at the end of the day we may have this really shiny awesome application, but if people can’t do what they need to do and solve their own problems, then we’re not delivering on value, and we have to go back and figure out how we can do that better.
Jason Hand 00:14:07.217 A lot of this, if you just really want to distill it down to something very, very simple that most people can understand, we’re just talking about continuous improvement. And I’m talking across the board. Everything that you do today is fine for today, but that’s today’s best practice. Next week’s best practice is going to be something else. We’re always kind of revisiting how we do things on the tooling front, on the process front, and also on the people side.
Jason Hand 00:14:29.842 So SRE. Hopefully that cleared things up for you a little bit. I want to touch on the fact that sometimes people get a little bit confused to that. Hey, isn’t SRE a role? Site Reliability Engineer? Or isn’t SRE a team? Is it the SRE team? Well there’s some truth to that, but I also want to point out that for a lot of companies, VictorOps included, we actually feel like there’s a little bit of an inherent danger in taking that approach in that one of the things that we talked about, the very first bullet point of DevOps, is that we’re trying to reduce organizational silos. So what often happens is teams sort of realize, hey, we should take this SRE stuff a little bit more serious, and maybe create this team or hire in an SRE engineer. What they inadvertently do is just create a new silo. They just give that team or that person that sole responsibility, and they actually did nothing to improve the overall reliability of their service. They just gave responsibility to another team or another individual. So for us, we wanted to create at VictorOps a culture of reliability where everyone from the engineering team all the way up to people like me, we are all responsible for the availability and the reliability of VictorOps. I think that that’s actually something a lot of companies and a lot of teams really embrace, and it’s helped them with their SRE journey, rather than saying, well this is the responsibility of one team or one group.
Jason Hand 00:15:52.295 You can’t have an SRE presentation if you don’t mention observability, and the reason is that this is sort of key to what most of the SRE, whether it’s a cultural shift or if it’s a new department, this is the key to everything. You need to be able to go in and ask questions. You need to ask your system questions about what’s happening right now, what happened last week. There’s a lot going on within a complex system and there’s no way to know all. No single senior developer is going to have a pulse on exactly every square corner of their system. In actuality, you just need to be able to go in and really observe more information about your system and ask those questions about how things are actually running in reality. And a big reason is that within complex systems there are so many unknown unknowns. We really have to start reducing those so we have a better, more clear picture of what’s taking place.
Jason Hand 00:16:47.726 So to start the journey, to start most companies’ SRE journey, what I have historically seen is that you really have to start asking yourself, what are we concerned with? What is really the main thing that might impact our reliability or our availability? If you go around, you get a bunch of people involved with this type of exercise and this type of conversation, and you ask them each individually what keeps you up at night, you’re going to get a lot of answers, because depending on what people work on and different roles and responsibilities, lots of people have different ideas about what is a real threat to the availability of our system. So it’s really great to ask that question, sort of round robin throughout your engineering team, and even your product team and other areas of your company. When you think about the value your system is delivering on, what is possibly something that’s going to derail that and prevent that from happening? So you have to go out and you start collecting data so that you can dig into that data and really understand.
Jason Hand 00:17:50.268 Now one of the things that we wanted to point out today is that even though we’re collecting a ton of data and we’re going to store that into someplace that we can go in and query it and sort of understand it in a little bit more holistic method, data means different things for different teams. Even though we’re collecting it and we’re going to set some triggers and some thresholds so that people get alerted about problems, we need to make sure that we’re very clear about defining the action that needs to be taken. One of the things we harp on people a lot at VictorOps is, you never want to page someone if it’s not actionable. Defining what needs to take place, what is the action, is something that you really need to spend some time focusing on very early on.
Jason Hand 00:18:32.820 When it comes to the journey, you’ve probably heard lots of people say, there is no silver bullet. There is no script. I can’t just tell you how to do SRE or even how to do DevOps. It’s kind of different for everyone, and everybody’s journey is a little sloppy and a little all over the place. But one thing we certainly see a lot of the times is that very early on, sort of less mature organizations within their DevOps and SRE journey, they tend to rely on more specialized operations tools. Then as they evolve, as they sort of get more comfortable with some of the toolings out there, some of the things we’re talking about today, then you start to see this convergence between SRE and the software engineering tools, and you start to see more historically operations-minded things now starting to show up with our developers. So for example, developers are now going on call, whereas just maybe five or ten years ago that wasn’t something that we saw very widespread. But now developers are signing up for VictorOps accounts, and they’re also signing up and creating Grafana accounts, and so on, that kind of stuff. This is a new shift for them. These are tools that they have not necessarily been exposed to. The point is that everybody’s journey is going to be a little bit different, but you will start to see these kind of convergence of tools as they mature along their journey.
Jason Hand 00:19:48.068 I always get the question, well that makes sense, but how do I do it? Where do I start? Where do I begin? The answer is always that it depends, because everybody’s journey is a little bit different. For us it started with that question, what keeps us up at night? So I think that’s kind of my piece of advice that I give a lot of companies, let’s go around and have this discussion. But once you’ve had that discussion, then you need to start talking about what does our toolchain look like to actually capture this information and verify the concerns that we have, or verify the areas of improvement that we’d like make. So from there, I want to kind of hand it off to Leonard from Grafana, and he’s going to tell us a little bit of a sort of story to help us illustrate what we’re talking about here. So yeah. Leonard, I’m going to hand it over to you, and why don’t you take it away from there?
Leonard Graham 00:20:35.400 Thank you. So this is Leonard from Grafana, one of the core developers of Grafana. In a not-so-distant life I was responsible for implementing SRE and DevOps at my previous job. I’m going to set the stage a little bit and talk about a team. They work at a medium sized company, and the company has decided it’s going to go more into a kind of DevOps, SRE journey. Well things might not go as they should and things are a bit shaky. All of a sudden this team, who has been developing services for quite a long time, are suddenly responsible for deployment and reliability and being on call for a service that, well, they’ve developed it for years, but to be honest they don’t know that much about it or operations tools. So this is going to be a bit of a challenge for them.
Leonard Graham 00:21:31.260 Can we go to next slide? What happens is something that you guys probably can recognize yourselves. Things that you never wanted to happen but it happens anyway. Support contacts and developers with a problem that their users have come to them with. The service seems to be quite generally unavailable, and developers are dumbfounded. They know that there are some systems in place to kind of look at the applications, but they’re not that used to looking at them. But thankfully, the old ops team that now are responsible for the infrastructure, they have left behind a setup with Telegraf and InfluxDB to kind of gather and collect metrics, and they have some basic dashboards in Grafana to be able to look at the requests going on within their system. They even have some alarms for VictorOps. Unfortunately, nothing that kind of contacted them this time, so they heard from support that there was a problem instead of hearing from their on-call, incident management system.
Leonard Graham 00:22:44.676 Can we go to the next slide? It turns out that they have an increase in requests to their system, so they go to their phone and they find this. But they also see that while the requests are increasing, which is generally a good thing, there’s nothing in there regarding logins going up or purchases going up or anything like that. If you see an increase in visitors, you generally hope that you would also see an increase in business activity, and this gets them a bit confused.
Leonard Graham 00:23:25.162 So on the next slide, thankfully they talked to one of the other teams, and the other team kind of suggested them that they should look at the RED method and use that as a tool to kind of get a bit of a better grasp of what’s actually happening with their system. So the RED method is basically a way to look at how to handle the basic metrics that you want to look at for a request-based system. First you look at the request rate, so rate of requests over a period of minutes or seconds or whatever you choose as your period, and thankfully they’re already tracking this. But you also look at the rate of errors. How many of these requests are actually causing problems in the system, and what’s the duration of these requests? As it turns out, for this team, thankfully while only request rate was graphing in their Grafana dashboard, both errors and duration were already tracked. So after some fiddling around, they set up the new dashboards and they could start looking at what was going on within their system and get a little bit more of a grasp.
Leonard Graham 00:24:41.459 On the next slide, we can learn that the reason for these problems were client-side failures. Well to backtrack a little bit, what they saw on the other graphs were that failure rates had gone up quite dramatically. Looking back previously, failure rates had been around 5% or something like that, but now it had gone up to maybe 50%. And also the average durations for all of these requests had also gone up a lot. This kind of led them to look at the web service, and they found out that they were dropping requests on the client side. Due to a faulty confirmation within their backends, those requests weren’t closed and they would go on until they timed out after 30 seconds, and that’s what caused all these basically retries, caused these new extra requests. So having the RED method kind of helped them figure out where they should, in the next step, go and look, and they went to the logs and they found these errors in the web server logs.
Leonard Graham 00:25:51.598 So this team, I mentioned they had Telegraf set up. So Telegraf is for collecting information. This was then sent to InfluxDB, that was storing the information. So that’s a time series that can handle events and everything. And then they had Grafana for actually looking at this information. Well now, no alerts came to VictorOps, but after ending up with these kind of problems they set up just the basic, basic alerting, looking at their request rate and making sure that if it went too far above the normal error rate they would actually see that, okay, it’s a 30% error rate. Now we want an alarm and we want to be able to go and look at that, potentially.
Leonard Graham 00:26:45.106 So that’s all for my story for now, setting the stage. But let’s head over to Margo from InfluxDB and learn about how to use that software.
Margo Schaedel 00:26:57.013 Thank you, Leonard. So as he said, I’m Margo. I’m a Developer Advocate with InfluxData. I do apologize if there’s some noise behind me as I’m speaking. I’m chiming in from GlueCon Denver. Just before we start diving into InfluxDB and Telegraf, I wanted to kind of reiterate what Jason and Leonard have been saying as kind of the focus of how we built the InfluxData platform. In order to have a successful DevOps cycle, we always need to ensure that everyone participating in that cycle has a responsibility for performance and a good experience for the users. And that’s kind of the focus that we had while building out the Influx platform. It’s not enough to put the burden of performance management just on the shoulders of the SRE teams. The developers and builders are also responsible for collecting and reviewing metrics so that you can overall lessen any chance of performance issues in production. When we were building out the InfluxData platform, we wanted to focus on making it really easy to use for everybody. It is an open source core so that it’s pretty easy to pick up and start running with, and we just wanted to make it very easy to deploy. So with that, we’ll jump into the next slide where we’ll talk a little bit about the Influx platform.
Margo Schaedel 00:28:27.985 So here you can see that InfluxData is a platform to pull all different kinds of metrics and events. It’s not only pulling system metrics. We can pull IoT metrics, infrastructure metrics, business metrics, really anything that is timestamped is suitable for InfluxDB, since it is a database that is purpose built and optimized for time series data. Then from all that data, the most common use cases we’ll see on the right-hand side are, we see a lot in infrastructure monitoring, structured logging, tracing. It’s often used for forecasting and automation, and machine learning. There’s so many different variable use cases that you can pull value from all that data that’s stored in InfluxDB. So InfluxData is actually a platform with four different components, but today we’ll only we talking about two of the open source projects. The first one is Telegraf, that is our data collecting agent, and the second is InfluxDB, which is our purpose-built Time Series Database.
Margo Schaedel 00:29:52.858 We’ll jump into Telegraf first. All right. So what is Telegraf? It’s a plugin-driven server agent and we can use it to collect and report metrics and events. There are both input plugins for pulling our data, and then output plugins to push our metrics elsewhere. I think there’s about 160 different plugins, all open source, that are available to use to pull from different sources or push to different places. Some of them, for the input plugins, we can pull directly from the system that we’re running on, so that will pull system metrics, and I believe that’s the default configuration. We can pull metrics from third party APIs. We can also listen for metrics via like a StatsD or a Kafka consumer services plugin. And then we can also send our data elsewhere. You can have the data from Telegraf sent to InfluxDB, but you can also send it to any other database. So here we’ve listed Graphite, OpenTSDB. You can send it to other message queues or services. So Telegraf is really versatile in what you can do with it, and I just want to stress that it’s all open source. Telegraf has really been driven by our community. We at InfluxData maybe wrote first six plugins to start things up, and then over a number of years we’ve seen it grow to about 160 now, and that’s mostly from community involvement. Okay. And one more thing to add with Telegraf and InfluxDB is that you don’t necessarily have to use Telegraf to use InfluxDB. So they are optimized to work together, but you can actually write data directly into InfluxDB through an HTTP API, or using a number of client libraries that are also available.
Margo Schaedel 00:32:06.385 So let’s jump into InfluxDB. This is our purpose-built Time Series Database. It’s built from the ground up to handle high volume data, high write and query loads, and it is a data store for time series data, so timestamped data, which is why we see it often used in DevOps monitoring, the IoT space, application metrics, and also in real time analytics. Overall the entire data store is pretty lightweight and you can kind of configure it according to your use case, so you can, if you need to keep the data for a defined length of time, we offer that feature through our retention policies. We can down sample the data automatically so that you have automatically expiring data so that you’re not overloading your system. And then you can also manually delete any unwanted data from the system as well. All that is capable through a very SQL-like query language, InfluxQL, for interacting with the data. As I said before, you can also write your data or send your data onward through any number of client libraries such as Node, Ruby, Python, Go for example. Then once you have all that data stored, it’s really well-suited to mix with other projects, such as being able to visualize your data with Grafana and kind of pull the value out of that data. So Leonard is actually going to be going through that with us. So I’ll pass it back to you, Leonard.
Leonard Graham 00:33:51.225 Thank you. So Grafana is a tool for showing data, and we’re especially focused on time series data. It’s very easy to use to create beautiful dashboards to show whatever kind of data you want to show all the time. It’s very easy to use. It’s open source. We have support for over 40 data sources. Of course we have support for InfluxDB, but also Graphite, Elasticsearch, and many, many more. Which means that Grafana is a very good place to kind of collect all your different data and add it together. You could add [inaudible] information and the information from your internal system from InfluxDB on the same dashboard and add all these numbers up, and get some business value out of it. If you want to take a look at Grafana, you’re more than welcome to go to our play site at play.grafana.org, where you can both look at some nice dashboards, but also kind of play around with them and see how they work on the backend and fiddle around.
Leonard Graham 00:35:01.337 Next up, I’m going to talk a little bit about time series, because there is something that’s actually basically everywhere. If you have a data point and you have a timestamp, and then you collect a few of those, hopefully with a fairly even duration between them, you have a time series. So that could be temperature information, or perhaps information about your infrastructure or your applications, or even a power plant or beehives, basically you could use Grafana to visualize this kind of data and play around with it. Grafana is very focused on dashboards. It’s kind of what everything builds from in Grafana. The first thing we did, that actually is very easy to do with Grafana, is to create these dashboards that shows your data.
Leonard Graham 00:35:55.664 On the next slide I’m going to show you how you would go about creating your first dashboard. So we go to the plus up on the left side and we create a new dashboard. And when we do that we get this kind of panel that we can configure. As I mentioned, I think I mentioned before, the graph panel is kind of the most common panel, but there are also single stat and text-based panels and so on. That means that we can create these kind of beautiful and rich dashboards that it shows on the previous slide. There’s many more plugins than the one that shows here that you can download and install as well. But of course just creating a dashboard like this isn’t super interesting. When we get to the really interesting part is when we start adding a graph and we start adding data. And that’s also really easy to do. I think very soon here we’re going to start from the ground up. Here we have an empty dashboard and I have a query editor. So this is an InfluxDB data source and we add CPU. As you can see, it’s called CPU mean right now, but we changed the alias to idle, as it’s CPU usage idle. We also add CPU usage user to the graph and name that as well, and now we have a dashboard with both of these values at the same time. We can also rename this panel. So many, many more settings than the ones we show here. We just wanted to give you a very short, quick introduction on how to create your first graph.
Leonard Graham 00:37:48.143 But there’s more than that, especially when we talk about what we do right here with SRE and DevOps. It gets very important to be able to react on what you see in your data. And when you look at data over time, it can get very easy to kind of see patterns and see where you have your thresholds. So in Grafana you can actually create these alerts on any graph panel and set thresholds, and show for how long they have to be above a threshold to go into an alarm state, or below a threshold. And you can even use the graph panel, as we just saw, to kind of drag your thresholds to a specific point in the graph. That’s something that the team we talked about in the story before could have used to make sure that they set the threshold for the error rate at a reasonable level, where they knew that generally we don’t go over this error rate. Also you can add even more things to the dashboard.
Leonard Graham 00:39:05.704 So next up, we’re going to try to add some annotations as well. It’s okay you have alerting, but now something has happened and you want to share with the team that something has happened, something you’re investigating. You can actually annotate your graphs and show that in your dashboard, so that it becomes very easy for the others on your team to know when something has gone wrong and if someone’s looking at it. Or for example, you could go in and connect your deployment system with Grafana, and by doing that make it so that all the “Deploys” you do on a system will show up as lines, annotated lines in the system, so that when you see that after “Deploy” the graph kind of veers very quickly up or down, you can see that that problem you see right there, the change you see right there, is most certainly related to the deployment you just did. So by very small means you can actually get a lot of value and help to understanding your data.
Leonard Graham 00:40:18.049 So on the next slide I just want to show a few different dashboards here. As you can see, we have a world map where you can show data, kind of what’s going on in different places in the world. Then on the left-hand side we have a very rich dashboard. We have a few different plugins that show you different information from a system called Savex. And I would definitely recommend it, if you have data that you want to look at, throw it up on a TV. Do a nice dashboard, it doesn’t take you that long, and throw it up on a TV, and two things will happen. Partly you will walk by that TV, hopefully, if you put it in a good place, every day, and you will start to learn to recognize the patterns within your systems so that you can then, by just going by and sending a glance while you fetch coffee at the TV, you can see that something’s going wrong. Or perhaps going well. Perhaps sales are up, way, way above what you would expect, and now you want to talk to the business people and find out what’s going on. But also by doing this, by visualizing your system in a very open and shareable way, it makes it a lot easier to have these discussions with other people that wouldn’t necessarily look at the internals of your system normally. So at previous jobs where I’ve been, it’s been very common for me to kind of set up the TV, and then all of a sudden the product owner and one of the designers and one of the sales guys are all of a sudden standing by our monitor and having interesting discussions with us about the internals of the systems, discussions that we would never have been able to have previously. Well that’s all from me for now. I’m going to hand it over to Jason, that’s going to talk to you about incidents and how to deal with them.
Jason Hand 00:42:18.078 Cool. Thank you, Leonard, and thank you, Margo, for both of those kind of quick introductions to Influx as well as Grafana. I want to mention something that Leonard just talked about because it’s super important and I don’t want to skim over it. But you just hear this sort of notion that you put these dashboards up onto a TV and suddenly conversations start popping up, people from different areas of the organization come up because they’re curious and they see data that looks interesting. It definitely helps sort of with that first bullet point of the DevOps and both SRE’s sort of, I guess, philosophies, in that we want to reduce those organizational silos. We want to show more about what’s happening, not just with the infrastructure. You know, some of the demos that Leonard was showing was like CPU load and things that are more related to the underlying infrastructure, but he also mentioned what does our sign ins look like at the moment? At any given moment within your system, you’re going to have X number of people in there. If that dips below a certain number, then maybe somebody needs to know about that. And if you don’t make that stuff visual and you’re not setting alerts and thresholds and that kind of stuff, then the wrong people are aware of information, or they’re not fully comprehending the data that’s in front of them. It kind of goes back to that comment we had earlier that data means different things to different people. I can tell you recently we actually have a giant TV here in VictorOps and we have a Grafana dashboard that kind of rotates through a couple of different ones, and there was one day when one of our dashboards had some metrics on there that were all in the red. It definitely looked like some sort of big problem, but as soon as I raised it to someone in our engineering team, they pointed out that that is something we noticed and we want to make sure people see, but the way our system is designed for degradation and sort of shifting resources around, this is really just to let us know that something is happening, but it is not an indication that VictorOps is experiencing some sort of problem. So it’s always important to understand the data and what it’s really telling you, and Grafana combined with Influx is a great way to sort of pull that off.
Jason Hand 00:44:22.462 I want to kind of wrap this up. We’ve talked about how you can collect this information, how you can store this information, how you can represent it in a way where you can really truly understand more about how your systems work. Leonard showed us how you can go in and create some thresholds so that when something does sort of get out of bounds, you make sure somebody is aware of it. One of the things that we talk about a lot in terms of the DevOps s
pace is where is the technology going? You hear things about artificial intelligence and all these sort of fantasy ideas about robots taking our jobs and really just automating so much of the problems that we see in our lives. I think it’s fair to say that for any reasonable amount of time in our foreseeable future, there’s always going to be humans on the other end of that page. So when something goes wrong with your systems or your business, or whatever it is that you’re trying to observe and monitor, there’s going to be somebody who is on the other end of that. You have to really understand that the people part is kind of that squishy, harder part compared to some of the technology things. So we have to think about how do we create a humane experience for our human response and our human responders?
Jason Hand 00:45:37.513 So I’m going to kind of close up with a few ideas about how you can improve that whole thing. The first thing is that it’s really important to understand that when something goes wrong, it’s not simply just an alert that’s generated and then someone comes in and maybe looks at something and then solves it, and then goes back to what they were doing. Now of course we know that that is how it works for a lot of folks, but there’s more to the story that is, in those scenarios, kind of left on the floor. You didn’t even really dig in. So for us we like to remind people that detection is sort of the first phase, although it’s circular, but it’s certainly the early part of the problem. We detect that there’s an issue. We kind of showed you that earlier. Now we’ve got this response phase where we’ve got the right people involved. They have the context, kind of like what Leonard was talking about, where we can see annotations and we did a deployment here. And although we can’t say for absolute certain that this service disruption is caused by this deployment, there’s a strong correlation there, so we should probably go back and dig into that a little bit. So we’ve got the response phase where we’re trying to triage and understand what’s happening and how bad is it. Then we get into the remediation phase, and this is where people are actually doing things that are recovering the system or restoring service. You know, running scripts, or whatever it may be to actually bring the service back. So that’s the remediation phase.
Jason Hand 00:46:58.613 A lot of teams, they’ll stop after that. As soon as they’ve resolved the issue, they’ve got service restored and recovered, they kind of just go back to what they were doing. And the unfortunate part is that it’s in this next phase, this analysis phase, that is sort of the gold under the rubble of whatever big problem that you just had, and most people don’t take the time to really sit down and analyze that. A lot of teams will do root cause analysis. I could give a whole talk about how that’s not the right approach. But the point is you really have to go in and analyze the data in a much more holistic way, including the human response. How well did we swarm to the problem, how well did we dig into the data and the context, and is there any way to improve each of those phases in terms of time? Maybe it took us two minutes to detect that. Can we somehow cut that in half? And the same with the response and the remediation phase. All that feeds into readiness. So we’re never going to be able to prevent problems, but we can certainly be ready, we can be poised, we can be responsive, and make sure we have our run books in place, and make sure we have our dashboards in place. Just really make sure that when something goes wrong, not if, but when something goes wrong, somebody, whoever is that first responder, has everything they need. It’s something we’re just prepared for. So that leads us back into detection, because that’s going to improve the detection phase, and so on and so forth. So it’s important to really understand that incidents are much more a bigger deal than I think a lot of us initially like to think about.
Jason Hand 00:48:21.746 Another thing I want to point out, and I’ve already kind of mentioned this, but once you’ve kind of prepared yourself and you’re more ready for things, your posture switches away from this reactionary to more of responsive, and just responding to problems as though they’re a known thing. If you think about a fire department, they have all their equipment ready to go, the fire trucks are pointed towards the door. Everything’s ready to go because they know something at some point is going to happen. It’s not going to be a surprise, we’re going to be ready for this.
Jason Hand 00:48:51.965 Another thing we’ve kind of talked a little bit about here, too, that VictorOps tries to make very easy and part of the core services, is who owns what. Who’s responsible for these problems when they happen? So it’s very easy to come in, create your teams, create your roles, establish just a little bit of rules in terms of who should be alerted about what types of problems. You want to go in and kind of set that up as it’s appropriate for the services that you’re building and teams that you have.
Jason Hand 00:49:23.053 Some of the other things that you can do to make the on call experience a lot more humane and reduce each of those phases, especially in the response and remediation phase, is adding some escalation policies. In terms of, what do you want to happen when Grafana tells us that this type of a problem may be looming or is going on right now? If somebody’s paged, but they don’t respond within, let’s say, 60 seconds, what do we do next? What is the escalation policy? And having that clearly defined in terms of what should happen when under certain circumstances will make the whole process so much easier. A lot of that’s done through alert rules that you can put in so that when certain things happen the system just automatically pages the right person. Maybe I’m not supposed to be on call right now, but the way I’ve designed it, I’ve set it so that if this one thing has any kind of problems, just ignore the on-call rotation. Don’t go page Dan, he doesn’t know how to fix this problem right now. Send it to me. So you can customize how you want to automate stuff and how you want to escalate things. You can even start to dip your toe into a little bit of self-healing with some web hooks and using some chat bots so that when something fails, if the way you’re going to solve that problem—let’s say maybe your Apache web server has stopped working. If you’re going to page me so that I can just log into that instance and then restart that service, well a much more humane way to approach that is maybe, rather than page me, let’s just ask one of our bots. Or let’s put in a rule that says go out and restart that service. And of course notify people in terms of just information rather than waking somebody up in the middle of the night.
Jason Hand 00:51:06.721 But all of this is sort of reliant on the ability to communicate. And communicating under stress, you know when there’s a really big problem happening and your systems are down, that becomes really, really hard. People tend to focus on what they’re doing. They don’t always elaborate on what they’re seeing and what they’re thinking and what they’re saying. But if you’re familiar with ChatOps at all, this is sort of another new introduction into the DevOps space where we’re trying to move all of the conversations that we have and all of the actions that we take, and just really like a detailed almost time series reproduction of what’s happening from an engineering side and from an operations side in terms of real activity. We put that into chat, because now that’s our interface into who’s doing what, and what was the results. Maybe I did go in and restart a server, and I did it from within Skype, or I did it from within chat, whatever chat tool I’m using, Slack. Everybody in my whole team got to see how I did that and what the result was. So ChatOps is this really cool thing that a lot of teams are starting to look into. I actually did a book for O’Reilly a couple years back on that, and there’s a link you can download that for free. But communication is always really tough during an incident, so having ways to make sure everybody is communicating very clearly is important.
Jason Hand 00:52:19.193 Then like I mentioned with this incident life cycle, all the different phases, most people kind of just drop off and don’t give much second thought to the analysis phase, and there’s so much interesting information in here about how systems are built and how services sort of maintain their availability. I’ll tell you right now, if you’ve got your entire engineering team or your operations team, and they’re basically on call all of the time, there’s going to be problems with your availability moving forward, because those people can’t sustain that. They’re going to become much more slower to respond to problems, because they’re just being paged all the time. Especially if you have a very noisy infrastructure or something within your system that’s flapping. So being able to visually understand in a report what it’s like to be on call for our organization, who’s responsible for this service, that can shed a lot of light on how you have your on-call teams designed and escalation policies, and really who’s struggling. Burnout is a real thing for a lot of us in the engineering world, especially those who feel responsible for keeping systems up and going. So this will tell you a little bit more about how that experience is like for those on call engineers. And you can also see things about frequency, tell you what’s your noisiest alerts, and you go in and kind of drill in and really tackle some of the areas of your systems that are causing problems.
Jason Hand 00:53:34.929 We talk a lot here at VictorOps about doing post-incident reviews. Many refer to these as postmortems. We don’t really love that term. We think that kind of insinuates something slightly different than what we’re actually talking about. This is a learning review. It’s not a root cause analysis. We don’t actually care about the root cause. We actually want to see if we can speed up all of the different phases of the incident. So let’s talk about what really happened during the detection phase. Why did it take so long for us to know the problem? Then let’s talk about the response phase. Jason was paged, but it took him five minutes to acknowledge that page and then dig into what’s going on. Five minutes isn’t too bad, but can we cut that down a little bit? So a post-incident review is designed to be a learning review. It’s to understand more of what took place, especially on the human side. It’s really less about what’s the thing that broke, because quite honestly in these large complex distributed systems that I keep going on and on about, your system is different tomorrow. So whatever you think is your atomic nuclear root cause to a problem is likely not part of the system tomorrow, or it’s changed in some dramatic way. So looking at failure and addressing failure in a different way is something we promote big time here at VictorOps with the post-incident reviews. And I also did a book for O’Reilly, another one of those free ones, too. So download that there with the link as well. And that’s going to help you sort of reduce the time it takes to understand problems and recover from problems.
Jason Hand 00:54:54.463 So if anybody’s familiar with the State of DevOps Report that comes out every year from Dora and Puppet Labs, they talk a lot about MTTA and MTTR especially, the mean time to resolution and mean time to repair. They actually have a formula for the cost of downtime, and they use the MTTR as a variable of that cost. So if you can drive your average time it takes to respond and recover from a problem down, you’re actually going to drive the cost of your downtime down as well. As many of you know, when our systems are down we’ve got a lot of engineers sitting on their hands, a lot of people sitting on their hands, and there’s some extra costs associated with that downtime that sometimes we don’t even think about.
Jason Hand 00:55:33.639 That’s pretty much all we wanted to talk about today in terms of the three products and how these kind of SRE tools all kind of stitch together. We’ve got InfluxDB, we’ve got Telegraf collecting the data first of all, sending that over to InfluxDB into this awesome Time Series Database store. We’ve got Grafana that’s allowing us to build some really amazing dashboards and set thresholds, and there’s lots of different ways we can go into Grafana and sort of establish what those thresholds are. And I’ll sort of push on you and say, you don’t want to go in and set a threshold and then not really come back and revisit that. We’re always kind of coming back and saying, hey, what is the best threshold for this? But there’s lots of ways, I’m sure if Leonard had more time he could have gone into how you can set sort of different ways to have those thresholds kind of move in variation with the system as it grows and scales. Then of course the humane on-call practices was the last bit there.
Jason Hand 00:56:32.615 So that’s pretty much it in terms of how we want to share a very basic SRE toolchain, and I think we’re going to move on to some of our questions here. I know we asked—or we didn’t ask, I meant to encourage you to put some of the questions in the little Q&A field. So I think we have a few here I’m going to read, and then we’ll go from there.
Jason Hand 00:56:54.768 So our first question. Actually, Leonard? If you’ve got your ears on, this will be a question I think may be specifically for you.
Leonard Graham 00:57:01.697 Yep.
Jason Hand 00:57:02.435 What should we be looking at for application metrics?
Leonard Graham 00:57:10.174 So basically I would try to go back to what Jason talked about in the beginning. Or what you talked about in the beginning. What are the things that keep us up at night? In my previous job I was working on a login system for a very popular game, and what was keeping me up at night was two things. It was basically, are we able to sell the game? And are our players, when they bought the game, able to log in? So we had measurements for those two things. Were we able to sell the game and were our users logging in at the rate that we were expecting them to? And if they weren’t, we were going to send alarms for that. So that’s kind of—generally I’d say application metrics, if you work with applications. Of course, if you’re in infrastructure and you work with Kubernetes, then you kind of look at what’s keeping me up at night in regards to Kubernetes for example.
Jason Hand 00:58:07.266 Awesome. Thank you so much, Leonard. Hopefully that answers our question there. And then let’s do one more question here. I can probably answer this one. Well actually, Margo, maybe you and I can tag this one. But what are your thoughts on self-healing systems, and can this suggested toolchain address that? Margo, I know in terms of the Influx universe you have a little bit of some ideas, some thoughts and feels on artificial intelligence and machine learning. You want to approach that question?
Margo Schaedel 00:58:42.772 Sure. Sorry. Okay. Yeah, so Influx is optimized for a lot of predictions, forecasting, and machine learning. That’s kind of designed more to be used with Kapacitor, which is our processing engine, that handles a lot of that kind of trying to see how your data’s going to look in the future and predict possible issues in the future. Which we didn’t really have a chance to talk about Kapacitor, the Kapacitor component today. But that is definitely optimized for that. I don’t know if you want to add anything more about the—I don’t know if you want to add about the self-healing process.
Jason Hand 00:59:30.455 Yeah, well I think in terms of self-healing, I think the machine learning stuff is going to be a huge lever for us to pull to help us move into this self-healing area. I think for me, I guess I’m like a lot of people in operations, I’ve been burned too many times with automation. So I have trust issues with a lot of self-healing stuff. So for me, I kind of enjoy using, there’s a tool within VictorOps called the transmogrifier that a lot of our folks use. Like I said, using webhooks or using a chat bot, you can start to do a little bit of just sort of executing outside scripts or outside actions to take place to do something on your behalf. For me that’s kind of the sweet spot right now. It’s very exploratory. It’s very, let’s see what we can do to take some of the very simple problems and automate those, but at the same time I want to make sure I’m automating the right thing. Automating the bad things is even [inaudible]. So hopefully that answers that question.
Jason Hand 01:00:34.158 I know we’re getting short on time here, so I think I’m going to go ahead and wrap it up. Thank you so much to InfluxDB, or InfluxData, Grafana. A little bit more, if you want to learn more from each of these we’ve got some links here. Go to influxdata.com/download. They’ve got all of their different open source tooling there. Grafana is at grafana.com/get. You can go check that out. Leonard also shared the play link earlier. And then if you want to go try out VictorOps, go to try.victorops.com/trial. That’ll kick you in there and get you started really quickly. A couple other things, I mentioned some of the books I’ve written earlier. The most recent book I wrote is this one here called Build a Resilient Future Faster. It’s essentially our story here at VictorOps from the very beginning, like I said about a year ago, a little longer, when we had our first conversations, all the way to our very first chaos engineering exercise. It’s kind of a chronicle of what we did, the questions we asked ourselves, how we sort of set up internally. We took the more cultural approach. We did not want to create an SRE team or an SRE engineer. So it’s kind of our story. It’s not a right or wrong and this is how you do it, but it’s how we did it. So check that out.
Jason Hand 01:01:50.981 And with that, thank you so much again, Margo and Leonard, for being a part of this. If you have any questions, reach out to us on Twitter. We always love to engage with people in the community. I guess that’s pretty much it. So thanks again, and we will talk to you next time. Bye.
Leonard Graham 01:02:09.662 Bye.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.