Coming soon! Our webinar just ended. Check back soon to watch the video.
Webinar Date: 2018-10-16 08:00:00 (Pacific Time)
John Hall and the team at Swiss Re were looking for a solution that could help them with proactive monitoring and trending for both Cloud and on premise as well as Linux and Windows. In this webinar, John will share how they evaluated several solutions while building out the requirements for determining what kind of data to collect and how long they should keep the data. In the end, they realized that they needed to collect a mixture of Business process infrastructure metrics to ensure they had a solution that could do proactive monitoring and trending.
Watch the webinar “How Swiss Re Went Agentless with InfluxDB” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Swiss Re Went Agentless with InfluxDB”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• John Hall: IT Architect, Swiss Re
Chris Churilo 00:00:00.429 Let me introduce you to John Hall. And John is going to be reviewing the solution that he created with InfluxDB. Don’t be shy. John, as you heard earlier, is a super nice guy. And if you have any questions, just post it either in the chat or the Q&A. So just go to your Zoom app, you should see those icons in there. You can type your questions in there. And so we’ll answer those questions if there’s a natural pause in the presentation, otherwise, we’ll reserve all the questions to the end of the presentation. And with that, I’m just going to pass the ball over to you, John.
John Hall 00:00:36.540 And so, good morning, good afternoon, good evening, depending on your time zone. I’ll run through the sequence of questions that we kind of mapped out as part of this. So if we have other questions at the end or midway through, just feel free to add them in. So first of all, I’ll do a quick introduction. You’ve probably read, I work for Swiss Re. I work a lot predominantly in the database area, although I do touch on quite a number of other subjects as well. And we’re going to look at how we address some certain, let’s say, issues along the way.
John Hall 00:01:18.274 So for those of you who’ve never heard of Swiss Re, and there’s probably quite a number in the audience, it’s a reinsurance company. We sell insurance to insurance companies effectively. And recently, we have started to break out of that trend a little bit in US in particular, where we start selling direct. But if you’re interested in the business and detail, I’m sure you can go to the Swiss Re website and find out more in details. But overall, one of the things, just to set out here, is Swiss Re is not an IT company. So a lot of the things that we need to do from our side are not industry wide, going through a revolution, or change the world. They’re for us in-house. So our solution, in most cases, is something specific to us, but potentially could be reused by any of you.
John Hall 00:02:13.912 Now to set the stage. There was a couple of key things that kind of hit us with why we went this route in the first place. The existing solution was super expensive. I’m not going to tell you how much, but let’s just say it was a six-figure number. Solution was slow to adapt which is kind of a problem with the leasing market that things do change on a regular basis. And it was not really designed for a PaaS offering. So when you have Platforms as a Service in the cloud, since it was agent based, it was an impossible solution for us to maintain because you can’t get the end cloud vendor to install something for you on a service that effectively they’re maintaining. And while dashboards and reporting information can be provided by a lot of these vendors, it doesn’t really integrate into one place very easily. And it meant a lot of work if we were to take reports from different places and then integrate them. That sets the stage as to where we were at the beginning and why we went this route.
John Hall 00:03:20.772 So why Influx? Well, a couple of simple reasons. And I don’t work for the marketing department, just in case you ask. It was fast, it was scalable, it was stable, and it was easy to use, and all the existing knowledge because we have guys that worked in a mixture of backgrounds, some in DevOps, some in Linux. So the majority of the crowd that was not up for immediate adoption would be your Windows admins because for them it was something completely new. And consequently, that’s probably one of the sticky areas that we’ll touch on later in terms of where we had the adoption or lack of adoption. But for, let’s say, the majority of us who knew the environment and knew what it was capable of, it was particularly a no-brainer. And we said, “Yeah. That will do what we need.”
John Hall 00:04:19.600 So we have made a list of what the solution had to do. It had to cost less. It had to be easy to use. It needed to work on Windows, and it needed to work on Linux. It needed to store data for auditing purposes, and it needed to store performance data. What we wanted to do with it—we wanted something that we could use our existing skill sets because a lot of the tools that are out there, they force you to learn a new application or change the way that you work. We didn’t want to do that if we already had the skills available. Must work on both public and private cloud, and that way we kind of come back to the PaaS offering scenario where we don’t always have the option to install things though. And it must be really close to real time, as much as possible.
John Hall 00:05:12.218 Another requirement for us kind of moving forward is we want it to be as much as possible a container-based environment because we want to work with the principle of Infrastructure as Code. And to do that, we knew we kind of needed to go for the container base, the ability to spin up repeatable instances without massive work underneath. This also improves our speed for deployment and scalability. It also means that we can move between cloud vendors relatively easy if we needed to. And these were some of the key requirements that we had from an IT perspective going into it.
John Hall 00:05:55.132 Now, some of the things that we encountered from the container side potentially being a problem was first of all getting people to understand what’s the difference between a container and a VM. And the way that this moves in terms of going toward a container versus a VM also meant that a little bit of change in the mentality of the IT department. And a lot of admins are kind of traditional in their way that, “Oh, I run my scripts and I’m done.” Yeah. The concept of checking them in and out of Git was something new to them. And there was a little bit of, let’s say, push back and forth in terms of, “Hey, I’m not writing code. I’m not a developer.” So these were, again, more of a mentality thing, but it was something that was relatively easy to overcome.
John Hall 00:06:45.742 Now, this is where the hard part starts. Whenever you start with collection of data, you need to work out what data you need. And for us, the biggest problems was actually defining what data we needed and how long we needed it. So do we need log data? Do we need performance data? Do we need business process data? Do we need JMX data? And this kind of became the driving factor for a lot of the develop of how we collected because everyone had their own set of data to collect. Some people which were focused particularly on websites would tell you that, “This is the log data I need. I need all these error codes,” how long the string was, who was the connection, etc. DB admins would tell you, “I need to know how many sessions there are, and how frequently, and what the performance is.” And then you would have other applications where there’s something simple like, “Hey, is the service running? Nothing super complicated, I just need to know, is it running?” And then the retention of that data. Some people have hard fixed audit requirements that say, “I need to keep it for several years.” Others would say, “Yeah. I only need it for a week. I just need to see how it performed during our latest deployment.” So that forced us to go down the route of realizing amongst other things that we needed to deploy potentially multiple databases with different retention schedules for each application or at least a group of applications that had similar requirements.
John Hall 00:08:29.545 We’ll look into an example of how one started. So this was kind of a conversation around a web admin type scenario where he said, “Okay. I need these logs, and I need to know if I get a 404 or a 500 response.” And the conversation kind of went along the lines of, “Well, that’s great. And but what else do you need?” And then he goes, “Oh, no. That’s all I need.” We’re like, “So you don’t care how fast it is?” And he’d say, “Oh, well, that’s important, but maybe not right away.” So we looked back at how he was currently monitoring it, and sure he had performance data for the server, but he didn’t have any performance data for the website itself. And as you scale out the web farm, you can’t really, let’s say, test off the load balancer because you’re going to get a generic response, and you don’t know which host it’s necessarily coming from. So we needed to dial back and say, “Okay. We need to test this locally, and see what the responses are, and then we can send those responses on.” So we looked at, “Okay, we tagged the environment, and based on those tags, we know whether it’s a dev or a prod instance and what application it’s running.” So this kind of evolved from our initial example into getting things like status codes and response times.
John Hall 00:09:55.595 So you can see here the example where we say, “Okay. This is an app and its prod in the tagging.” But we were retrieving both the time in seconds it took to load and the status code. So if you look at it from the point of view, if the status code isn’t 200, then we consider it to be an error, and we have the ability to track and look at it, and say, “Okay. This is a problem.” But equally, we can use the BPM side of it and say, “Okay. We also look at the time that it’s responding.” And if the response time is a measurement, that we say, “Okay. It should respond in 10 seconds or less,” and it goes over that number, again, we can develop a threshold. And this is where really a time series database comes in really useful for this because we can constantly take those measurements. And if the measurement goes above a value say, “Okay. At this point in the day it went up,” and then look at why.
John Hall 00:10:53.452 Equally, performance metrics for regular service became similar kind of thing. Most people started with the, “Oh I want my memory. I want my CPU, etc.” And these were driven primarily by what they had already as tools. So they were looking to kind of do a one-to-one replacement. And that was the starting point, really, where we said, “Okay. Well, replacing what you’ve got is fine, but if we’re at the point of potentially looking into this and doing a bit more granularity, what else can you have? What don’t you have today that will allow to do your work better? And how can we drive that down to numbers?” So what we ended up with is a performance metric that started here with just two values. And it evolved into a metric which doesn’t even fit on the slide. This is only part of it. I had to cut out the rest because it really, truly went off the slide several times. And you see how it evolved into how much data can I get and what is relevant.
John Hall 00:12:04.652 So these are examples where we look at performance metrics. And we say this data probably is only kept for a day or two because we’re interested in what is the current state while other ones are kept potentially for several months, weeks, and even years. Because you also have things like, “Is the system compliant?” So we have a set of internal policies that says, “I must meet these criteria.” And you build up the checklist and then you push that checklist into a value. It’s either compliant or it’s not. It’s a one, or it’s a zero, or whatever status code you choose to give it. And in some cases, it can be multiple status codes. So you have as an example, instead of giving it a text value, you give it a number value. So if it’s compliant for first part of the policy, but the second part of the policy not, the return value may be 5,001. Or it may be 5,027 depending on what it is. But being able to create these internal code numbers, you know which part of the policy failed without needing the text, which allows you to very quickly and easily push it into a time series database as well and report effectively on your environment if it’s consistent with what your policy is or is not every time it runs.
John Hall 00:13:32.838 Now, this all resulted in an evolution of data over a period of time. So we started off with a very simple, minimal principle of creating one, two databases with different models of retention schedules. And it became very clear quickly that we were going to need more. And at one point, I think we lost track, it was around 10 or 12 different retention policies, but it’s okay because the data is per department or per business need. And you can have as many databases ultimately as you need. So don’t try to do what we did at the beginning which was force everything into one and then realize that that’s a really bad idea. Think about the retention and see if it fits together, if you have multiple groups that will fit in the same retention policy, or if it makes sense to just split them out and create as many as are needed.
John Hall 00:14:37.057 We also chose to go with Grafana as our front end. Really, it was simple choice for us. We happen to have the skills in-house. We already knew how it worked, and we had a couple of dashboards for different things already available. So for us, this was really a go-to choice. It wasn’t intentional at the beginning, and we did look at the whole TICK stack of using the Influx product as well, but since we already had Grafana and we had kind of a desire to not relearn skills at the time, that was our given choice.
John Hall 00:15:20.563 So I guess that kind of brings me to the end of the immediate section on this. Do we have any questions?
Chris Churilo 00:15:33.014 Looks like we’re good right now. If anybody does have any questions, please feel free to type them in the chat or the Q&A. Why don’t we just keep going, John?
John Hall 00:15:50.955 Okay. So realistically, from the slides point of view, I’m done. But let’s talk a little bit more in detail. So what we have ended up is a very flexible solution whereby every time we decided that we want to onboard a new metric—and this is one of the things that I do love about Influx. If I need to capture a new metric, I just point it to the DB and say, “Okay. This is what I collect.” And maybe one of the best of examples of this—if I flip back a couple of slides, and we look at the last part on this, where we say, “Okay. We’re writing a measurement to Influx.” And we say, “Okay. We’re using a tag.” By the way, for those of you who are not familiar, this is the PowerShell equivalent. You can do the same in Bash with curl [inaudible]. And all we’re doing is saying, “Okay. We capture a tag, which in this case is a server name, and we’re posting it through. Now the tag itself can also be an environment. It can be an application.
John Hall 00:16:59.009 What we found works best over a period of time, and I think it’s better captured on this one—yes, it is—was to use an environment and a machine name. So we have an application, and we say, “Okay. This is the prod environment, and this is the host server in the prod environment.” So in this case, we were measuring Web, so the first part is the type. So the first tag in this case is Web. And then after that we’re telling it, these are our prod environments. So this is our dev environment, and then finally, this is the host it’s coming from. This allows us to create dashboards we can group together quite easy. So we say, “Okay, it’s a web server. It falls into this group. It’s a prod, so if falls into that group.” And then you have the granularity of being able to drill down to the individual machine in the dashboard if you really need to. And you can collect any metric that you want. So in this case, we’re collecting just the two. But in some cases, we found that we were collecting 30, 40 metrics, really depending on the application.
John Hall 00:18:04.165 So the good example of this was where we have two runs. So you would actually have a collection that runs once, that goes and collects general performance data like this and has a counter in it. So every third or fourth run, it will then dump additional data, so in this case the log data, to a separate database. And that’s for the long-term collection. So that would be things like the three years’ worth of auditing because you don’t need a second-by-second capture for that kind of data. Every few minutes will do. So they created some nested loops in it.
John Hall 00:18:45.326 And the other part of this is a lot of the Bash scripts and PowerShell scripts and things that we used. At the end of the loop—sorry, if you see here, it’s a while/true statement. So if the collection for any reason, the last part would be to actually restart the server because in a lot of cases where you have things like the audit collection, if the audit collection fails, there’s something wrong. So it would initially try to restart, and you’d have a count loop which was put into the local environment. And if it tried three or four times and was not able, it would just shut it down and would force someone to go look at it. You could even use the collection statements to kind of force a compliance on your environment as well as collect that information also into Influx. So we could say, “Okay. If environment restarted more than X number of times, you could actually push that to the Influx.” So you can simply have on your dashboard a nice little number that says, “Server’s restart count.” And obviously, during patching Windows and other maintenance, that’s expected to go up. But if it went up during a normal week, you can easily say, “That’s unexpected behavior. I need to drill down, go have a look.” It takes a lot of the reactive nature out of it and pushes you into a very proactive nature because you know where and what’s going on in your environment, even though you have nothing deployed in terms of agents. You’re literally using a couple of scripts that are nested into your Git repository and is simply copied and run on the machine the moment you fire up a new instance.
John Hall 00:20:25.593 And this also meant that for release purposes it was also very quick and easy too. So whenever we have a new release, so you would have maybe a patching cycle as an example, it was very easy to say, “Okay. Well, we want to include this information as well.” You do your quick test in Dev and QA. And then when a production machine gets replaced, so if effectively spin up of new containers, you would start collecting that new information. Now, in your traditional environment where you’d have VMs or others, you need to go back and touch every one of those agents and say, “Oh, I need to collect this.” And in a lot of cases, if it wasn’t a, let’s say, easy to manage environment—and I can think of a few. So let’s say that you’ve got something like Datadog or SCOM or some others where you have templates, you would need to deploy those templates. And that’s really a lot of hassle when you compare it with the option of simply saying, “You know what? I’m going to restart these every periodic end period anyway because I’m going to refresh the container. And when I do, it’s going to immediately be right.” That’s a lot of flexibility comparatively. And you don’t need to worry about, “Is it up to date? Is it not up to date?” We can very quickly check.
John Hall 00:21:48.624 And we even got to the point where we said, “You know what? We can do one better. We can push the value. We can actually push if this container is up to date, what version of container is it?” So we would say, “Okay. The build number is X. We will push it into the environment variable.” So you could see how many containers in the environment were on which build number. And as you refresh them, you could see the numbers going down. So you could count the total number of, let’s say, build three, and when build four came out, you count the number of build four, and then you see them gradually actually reduce. So even if you had one that you missed, you would actually know that it was there, able to say, “I still have one, my dashboard, that says it’s on build three. I probably should find out why.”
Chris Churilo 00:22:39.746 So John, I’m going to interrupt you here because William has a question for you. And he wanted to know if you had any issues with network overhead?
John Hall 00:22:50.274 A little. Initially, the network overhead, the problem that we suffered was not internal or even the cloud per se. It was mostly with the interconnects. So what we found the best approach was to have a local repository as much as possible. So as an example, if you’re ever using Azure and you have on-prem servers pushing the Azure’s data back down the wire was kind of tricky on occasions because you can’t guarantee that connection. So it was easier to actually have the DB in Azure next to where you were doing the collection and then just pool for the dashboard purposes the data directly. In that way, only the dashboard is doing the WAN collection.
Chris Churilo 00:23:43.778 Awesome. So if anybody else has—William says thanks—any other questions, please put them in the chat. Otherwise, I will take up all of John’s time because I always have lots of questions. So John—
John Hall 00:23:57.042 I love questions.
Chris Churilo 00:23:58.933—this is such an elegant and it feels like a very calm approach and very systematic approach, but often a lot of us when we start to collect this stuff, we are in the middle of some kind of firefighting. We’re just in a hurry to just grab, as you mentioned, the metrics that we’ve already been using whether or not they were right or not. When we’re in that mode, we don’t even think about those things. So I mean, was your environment always like this? Are you guys always just good at taking that step back, thinking about it?
John Hall 00:24:38.429 No, no, no, no. Definitely not. I’ll tell you that I arrived to a relatively on fire service when I started. And it very quickly became apparent to me that we want to take control of it because you don’t want to be spending all of your time firefighting. So and this is a kind of a rule of thumb that I’ve come to over the years, you need to take time in order to have time. So I sat down with a whiteboard and said, “Okay. What do we need, and what do we like to have? What will help us going forward, and what will make it easier?” Now always, ideas evolve over time, so you kind of—you get to the point where you’re thinking, “Okay. That’s nice, but if I also had this.” And it’s not an easy Day One exercise ever, but you can get a very—let’s call it relaxed—process once you start thinking about it and you get out of the firefighting mode. So it’s very nice to get the data first of all, but if you can realize that it takes you maybe an extra two minutes to write an extra line to collect some more data, that you might be interested as well, then it’s totally worth it. So you say, “Okay. If I have this, I can also do that and vice versa.” So once you start with the lines of code and you have the basic one working, it’s very easy to add to it and say, “This is also useful to collect. How often do I need it? Do I even need it to go to the same DB? It might be a different DB.” And as you can see here as an example, we have one set of metrics. If I was to add a second or third point of metric, I can add in another line and say, “Point it to a different DB.” So it doesn’t even need to go to the same place. I can then have different retention polices based on the information I’m collecting. So you can say, “Okay. I have one that’s specific for deployments that collects only deployment versions and keeps an idea then of what’s running in my environment.” My retention period there might be like three days because I don’t care historically what was there. I care what’s there now.
Chris Churilo 00:27:00.561 Yeah. That totally makes sense. I mean, I know in the past—I remember sitting down with various teams and they would—I remember specifically asking a DBA, “What do you need?” And he gave me the longest list of metrics. And I just looked at him and I’m like, “Tell me what this does for you. What does it actually help you to understand?” And I remember him looking at me like I was crazy, but I was just like, “There’s no way that all this stuff is actually going to help you resolve an issue. You’re just collecting everything for the sake of collecting everything.”
John Hall 00:27:36.294 And that’s actually a very good point because we had a really long debate on that subject too. Because one guy who was just like your DBA was like, “But I need all this data.” And I go, “No you don’t. What you need is, “Is it okay? Is it not okay?” Because when it’s not okay, what’s the first thing you’re going to do? You’re going to go look at it anyway.
Chris Churilo 00:27:58.758 Right. Exactly. Exactly.
John Hall 00:28:00.779 So think about what data you’re collecting. And then maybe a good example of that is actually this one, because this is actually a DBA collection. And that’s great because he’s able to generate graphs, and it shows him historical over trends. So he can see: did this problem start last week or is it today? But this is information that he doesn’t actually look at day-to-day. What he looks at is whether these values equal an okay status. So if, and this is one of the great things about having it scripted out—is you can have these values already set with an okay or not okay status within the script. So if you’re expecting them to go higher or lower than a certain value, you can always say, “Okay. This is a warning threshold. Change my one to a two or whatever,” and he gets an immediate alert on the dashboard that something’s not right, and he can go start taking a closer a look. And he still has these metrics available in a different DB where he can say, “Okay. Well it was okay earlier, but it’s not okay now. What happened?” And that’s the bit when he can go take a look and see, “Oh, the accounting team is running a monthly process that’s consuming 100%.”
Chris Churilo 00:29:14.150 Which is a legitimate thing to do. So yeah.
John Hall 00:29:17.279 Exactly.
Chris Churilo 00:29:18.508 It’s not like that everything’s on fire.
John Hall 00:29:21.894 And being able to go from that, getting a ticket, or getting an end user response. And actually, that’s one of the things, for the webserver’s is an example. We looked at how long does the process take. And we ended up changing some of the website code to include output to the log that would tell us how long that threat took to load. Now, to give you an example of that means in real terms, if you are on Facebook and it takes you two minutes to load a page, it still loaded. So you still get a 200 code returned, but no one is going to sit there and wait the two minutes. So you already have a problem, but you’re not aware of it. So what we looked at was when you have a thread, you can output how long that thread took. And we set an arbitrary value. We said, “Okay. This should all respond in five seconds.” And if it’s higher than five seconds, again, it pushes a value and says, “I have a problem here.” And that actually allowed developers to go back and say, “Okay. This is a thread where it’s running slow.” And they could find out where in their application and what the user was doing that was causing it without piling through tons of logs, which, again, trust me, nobody wants to do.
Chris Churilo 00:30:43.288 Yeah. But that’s what we have been doing, right? I think we’ve all had that experience, where—
John Hall 00:30:48.549 Well, we’ve all been in the reactive state, and we need to kind of get out of that defensive mode and into the proactive one as much as possible.
Chris Churilo 00:30:57.814 Yeah. I mean, first of all, it feels better to be ahead, and we want to get ahead. But I think, listening to you, it also probably unearths a lot of—this might seem harsh, but a lot of useless activity that we go through. We’re just collecting a bunch of stuff, doing a bunch of stuff, not necessarily helping us to get to the root cause, it’s just things that are—
John Hall 00:31:21.983 Well, that actually was another good benefit of this. If we look at the data that we collected with the previous application sets that we were using, it was substantial. I mean, we were talking about two, three hundred GB a month of data we were collecting. And since there was only one retention policy you could set, so it was global for the tool, that was really problematic. Because even when you did need the data, but you only needed it for a short term, you couldn’t get rid of it. So we were eating storage constantly for data that we maybe only used for a week or two. So being able to separate out those data collections actually saved quite a bit in storage. I won’t tell you how much. But I’ll tell you it’s in the four to five GB mark.
Chris Churilo 00:32:19.614 I believe it. I believe it. Yeah. It’s almost like we have a hoarding problem, and it’s both for just our past because we just feel compelled to keep everything, but I think the other side is also because the tools don’t help us to separate these things as you mentioned. Can you talk a little bit about—so you talked about how you use some of the off-the-shelf products, and you mentioned that they were a little bit too strict. Can you talk a little bit about that?
John Hall 00:32:53.590 Yeah. So let’s paint a couple of examples. So in an on-premise environment, you can do whatever you want, right? And most of us are quite familiar with the whole—and probably one of my pet hates, we occasionally give too much access to ourselves, but we can do what we want, and therefore we’re used to it. And when you migrate across to a cloud environment, it becomes a very different set of rules. Because suddenly, you are not working within the environment that you control, but someone else owns. So you start to realize, “Oh, I can’t do this because I don’t have that access.” And unless you are having a private built cloud, it’s very difficult to change anything. And particularly, if you’re going for public cloud, there is no option to change it. No Amazon, or Microsoft, or anyone else is going to change their service offering for you. So you look at things and you start thinking, “Okay. Well, how can I get what I need differently? I can’t install stuff on a machine that I don’t have ability to install, but I can get certain bits of information.” So we were able to grab many bits of information differently. So as an example with databases, we couldn’t grab CPU utilization in a lot of cases. But we could grab memory utilization and processor time that was spent on a task. So we said, “Okay. Is this relevant to what we’re doing?” We’re like, “Yes. We can see how much memory’s used. We can see how much is being spent on these processes. That’s good data and good metrics.” And we didn’t need access to the platform necessarily to get that. Equally, even when you do have access to the platform sometimes there are limits to what you can do as well. So you can’t play around with some of the networking functions or build bindings into settings. So even though you have some level of access, sometimes it’s a reduced access.
John Hall 00:34:59.188 So we have as an example, we’re able to log on to the machine. We’re able to look at counters, but we can’t install anything there because it’s, again, maintained by someone else. So depending on your cloud environment, you need to work out what you can and can’t do and kind of reverse engineer that with that information that I can log on with—can I get the performance counters. If yes, then I can do it agentlessly, either remotely or running a script on there. And that’s the other thing. Historically, we were very single-minded in this. We were like, “Okay. Go install an agent,” because that’s what we did. And we didn’t think twice about it because we had full admin on the machine. But when you can’t install stuff on the machine, you’re forced to think about, “Okay. Can I query it remotely? Or can I just run a batch script that will output that information for me? You need to look at it slightly differently and that forces a whole new mentality, shall we say.
Chris Churilo 00:36:02.790 Agreed. I completely agree with you. And yeah, I feel like I’ve gone through the same kind of a path. I’m sure a lot of us have, unless you’re just 23 years old and never had to deal with any on-prem stuff then.
John Hall 00:36:16.961 In which case, don’t worry about it. It’s not your problem [laughter].
Chris Churilo 00:36:22.076 So I just want to ask just a little bit about in your world, what are business process metrics or business process data? What do you guys—what’s the definition for you guys?
John Hall 00:36:35.926 Well, we’re in a nice environment where we don’t need to worry too much about uptime because it’s got a reasonable layer of redundancy. But if you look at it from our point of view, a lot of the stuff that we do is document processing. So imagine that you have, let’s say, 40, 50 resellers that sell to a customer, and they send you back a copy of, “Okay. This is the policy that was taken up by X.” So we have a lot of document ingestation whether that be PDFs or scans or various other formats that arrive. And sometimes it’s system to system. So it might be a JSON or an XML format. Doesn’t really matter how it comes in. So our systems need to then process that information. So keeping a track of, let’s say, how many documents or how much data per second we’re handling becomes kind of our business criteria. Because if it gets too slow, we’re going to get a backlog. And equally, we need to make sure the system is running. So is it up? Is it performing to an adequate standard? We set these values of, “Okay. We’re expecting X performance. And if it drops below that, we need to know: Is it because we don’t have that many documents coming in, or is it because we’re really slowing down for some reason.” And I give you an example, maybe our analytics department during a peak season can easily ingest 100,000 documents an hour. So quarter closing can be busy.
Chris Churilo 00:38:20.817 Yeah. Yeah. That’s a lot of documents. And so then, do those teams also get access to dashboards that have these metrics? Or is this more for the IT teams?
John Hall 00:38:33.180 Well, actually, that was one of the reasons we drove down this lane. You have the IT teams which are happy to know how the systems are running, but you also have the business process managers who want to know if their system is running well. So a bit like probably for anyone who’s had experience with SAP, you have those BPM probes that will tell you, “Is your workflow running okay?” We kind of started to develop those for each application, where it’s like, “Okay. Your values, is it running okay or not okay? Or based on this—” and then give you, “It’s running smoothly. It’s not running smoothly, ” and sometimes the number of documents and other things. We can see it’s a transaction thing. So we just count the number of transactions in the last 20 minutes, an hour, whatever. We just output that as a number and say, “Okay. This is how many documents you’re going through per minute.” So they can then see real-time numbers or semi-real time numbers how it’s progressing.
Chris Churilo 00:39:34.526 Right. I think that’s important for these business process managers to see. Because sometimes, probably at the quarter end, you’re a little bit stressed, and you want to be able to look at a dashboard that’s giving the reassurance that, “Oh, this is normal,” or, “This is better than normal,” or worse than normal, and then open a ticket. Versus when you’re in that kind of a stress mode, it’s really easy to just kind of overreact and just say like, “Oh my God. Everything’s slow,” and just panic the IT team. There’s no data, right?
John Hall 00:40:07.920 It’s actually quite nice because you can have a process that’s running terribly slow and someone opens a case like that and goes, “Everything is failing.” And you show them the other 30 dashboards that are all green and go, “No. It’s just yours [laughter].” It doesn’t make them happy, but at least it reduces the escalation level where you’re like, “Nope. It’s just this one. We need to work out why yours is not performing.”
Chris Churilo 00:40:30.299 Yeah. Yeah. But then, now you’re starting to set some expectations with them, right? Because now they know like: “Okay. We’ve got a place to look. We can see: Is there a real problem or not? Is it just us or not?” And I feel they feel like they’re actually being paid attention to. Because for a lot of the business process people, they feel like anything behind the screen is a black box. They have no idea what’s going on.
John Hall 00:40:56.138 Very much so. And it’s also kind of nice because when you have, let’s say, an improvement. So we can see this with development cycles occasionally where developers have put in the effort into this sprint and say, “Okay. I’ve now got a better version of this. We’ve done some performance tuning.” And you can actually look at the previous week and look at the current week and say: “Yep. I can actually see the difference. It’s 1,000 to a minute previously, and now it’s 1,200 a minute. Congratulations guys. There is a difference.”
Chris Churilo 00:41:25.493 Yeah. What I’ve actually heard of, users that actually gather that information before it goes to prod so that can prove to the scrum masters or the project managers, “Hey, this is going to be an improvement.” Or maybe the work that they’re working on hasn’t shown any improvements, so we need a little bit more time before we can push it to prod.
John Hall 00:41:50.326 Yeah. And actually I’ve worked in a couple of companies beforehand. One actually was a software house. And that was exactly—one of the problems we experienced there was they had a lovely document processing application. I can’t fault it for that. But the unit testing said, “Okay. We’ll test it with 1,000 documents.” And at the time, I ended up coming back going, “The customers are complaining that it’s slow, horribly slow.” They’re like, “But it works fine.” I’m like, “Yes. With 1,000 documents. Now put 100,000 documents in it and see what happens.”
Chris Churilo 00:42:23.755 Exactly. That’s the typical ‘works in my environment’ kind of response, right?
John Hall 00:42:29.025 Exactly. And that was also one of the beautiful things about going to a container-based environment. It was easy and repeatable. So whenever you have a scenario that, “Oh, this machine is different,” you don’t have that with containers. They’re exactly the same. So if it didn’t work in one, it didn’t work in the other. And it was as simple as that. So there was a lot of benefit with flipping over to this kind of way forward.
Chris Churilo 00:42:56.459 Yeah. I mean you’re basically eliminating any differences, right? So we can just really focus on, okay, what actually did change? Or what are some of those differences that are causing good or bad things? So in the beginning, you also talked about collecting data in near real time. And that’s always a funny term for us or for us, a collective in IT. What is real time for you guys, because I noticed that you said near real time? And I mean, I often do that as well because I always want to make sure that people know it’s not at your [crosstalk].
John Hall 00:43:34.262 It’s [crosstalk]. I don’t know. But that’s also funny, because depending on the tooling you’re using, it can never be real time. A lot of applications out there will either have a buffer of some kind, or alternatively, will put the information into the DB, but the refresh of the GUI doesn’t necessarily happen fast. So with us, we had the principle of, okay, the dashboard refreshes on a periodic time frame. So let’s say every 30 seconds or 10 seconds. But the data collection is maybe every 5, or longer depending on what the specified value by the team was. So on occasions, it would be 5, 10 seconds delay. On other times, it would be a minute. But if I look at that compared with previous tooling where we would have a five minute delay at least before the data would get to the DB, get to the GUI, and then on some cases another five minutes before it would make its way through the various integrations and generate a ticket for IT to look at, you could easily lose 15 minutes, by which time a user has found a problem and phoned up and already started complaining about this is not working as expected. Now we’re down to sort of a 10, 15 seconds and say, “Oh, my dashboard just turned red. I better have a look at something.” And it’s nice to be able to turn around and say, “You know what? Don’t need to do that.” And we actually started integrating this into the help desk system. So they get notified when the dashboard goes red as well. So they are aware that we’ve already started working on it, which is funny because when you have a customer phone up and say, “It’s not working.” They go, “Yes. We know. The ticket number is X-Y-Z.”
Chris Churilo 00:45:29.620 Yeah. It feels really good for the support desk to be able to do that. That’s pretty cool that you guys have implemented that.
John Hall 00:45:35.228 It also changes the perspective for the customer. If you imagine the customer’s perspective is, “IT is down,” but they phone up and they’re like, “Yeah. But we’re already investigating it.” They now realize that you’re proactively going after it. And for them it gives them that warm, fuzzy feeling that they can trust you to be reactive. They’re not needing to report it anymore. You will take care of it.
Chris Churilo 00:46:00.107 Right. Right. Yeah. It’s setting those expectations. And that’s a great way of setting expectations. Yeah. We’re on it. Got it. We’re working on it. We all want to be able to say that, and we don’t often get the chance. But it’s cool that you guys have kind of laid the groundwork, so you can start to not only say it yourself, but give the support team the ability to say that.
John Hall 00:46:23.473 Now, I’m going to tell you the scary thing. Guess how long it took to implement.
Chris Churilo 00:46:28.368 Oh, tell us.
John Hall 00:46:30.450 Total implementation time—if I discount the meetings of agreeing what we were going to monitor, which took a while by the way. That was probably the largest chunk of time. The actual time to implement and roll out was a little over a month.
Chris Churilo 00:46:45.083 Oh, wow. That’s not bad at all considering that those poor support team probably had like three years of heartache or longer, right? They must love you guys.
John Hall 00:47:02.455 See the problem was never the getting the data because it’s always there. The problem is deciding how to use it constructively and agreeing what data to use. So if I count all the meetings that we had back and forth agreeing. Do we need this bit? Do we need that bit? That probably took four of five months. But the actual implementation was like a month.
Chris Churilo 00:47:27.942 Yeah. We do get a little bit bogged down with a lot of conversations. And sometimes it’s just do it. I have a couple more questions, but before I take up all of your time, I’m going to reach out to the audience. If you have any questions, please put it in the chat or the Q&A. And looks like we have a couple questions. Can you describe your—what is this? Do you have any plans for using any other components of the TICK Stack or any other technologies for the future implementation of this?
John Hall 00:48:05.145 Yes, we do. I can’t say too much about it, but we do actually have a dedicated monitoring team. And they are looking at how many more of these we can implement. I mean, just to set the stage, we haven’t completely replaced our original tooling yet. It’s kind of being gradually phased out. And what they look at is they’ve got about four or five different tools. So we actually do use Datadog, as an example for some stuff. But we’re looking at how much we can switch over to something like TICK Stack to do an equivalent. Because the advantage some of the teams went for Datadog was because for them, it was a simple installation and run. And then we said: “Well, actually, TICK Stack can do that too.” And it became an education one because it was easy for a manager to whip out his credit card and say, “We’re going to do this.” And then we come back to them and say, “Why? Why are you doing this?” So there is a little bit of a disconnect sometimes. You see the same with cloud environments where someone just whips out a credit card and starts firing up stuff.
Chris Churilo 00:49:13.129 Right. And I think there is also the impression that as quick as it is to throw out a credit card, it’s also—the impression is, “Oh, it’s just as easy for my team to just set this up, and everything else is hard.” I’ve heard that a lot with some of these solutions.
John Hall 00:49:31.118 Yeah. And this is the fun one because I’m looking at it from the point of view where I’m going, “Okay. You got out your credit card as an example. You still need to set the templates for Datadog. It’s not going to just collect stuff, at least probably not what you want. And then you need to go to the dashboard and you need to configure it.” And I’m like, “That’s exactly the same as we’re doing.” So the time investment is the same, and the only difference is you’re getting a monthly bill at the end of it for an external party.
Chris Churilo 00:49:59.627 Right. That you can’t really modify, or you can’t reset the retention policies and all the kinds of great stuff that you described earlier.
John Hall 00:50:08.216 Yep. And you’ve also now lost the support of the internal IT because you can’t monitor or change it. So the beauty about this is because everything here is basically set up internally, we know how it works.
Chris Churilo 00:50:22.813 Right. Right. Exactly. Exactly. So you have much more control over basically your destiny.
John Hall 00:50:30.375 Yeah. And we can say, “Okay. This is an example of what we did for this other department. Is it something similar to this? And can we reuse it?” So the more we do, the faster it gets.
Chris Churilo 00:50:41.135 We have another question in the chat. Just wondering about the architecture diagram that shows the use of InfluxDB in your solution. So you talked about you have multiple versions of InfluxDB with various retention policies. Is this all just hosted centrally? Or can you describe a little bit about what that architecture looks like?
John Hall 00:51:08.600 There are quite a few actually. So for us we needed to put in a proxy layer to make sure that we could consistently get it updated in case we were doing any maintenance work on the InfluxDB. So we do actually have multiple service from that point of view with the same databases. Equally, we split out some of the, let’s say, development and prod tiers as well, not really for load purposes or anything else, but just so it was easier, so that we had the ability to update Influx in a development tier first before doing prod. Not that we’re expecting any issues, but it’s always nice to separate them out. And then we also, again, for some, I emphasize, some cloud providers, we noticed there was a latency delay depending on the location. So we did have—or we do have, rather I would say, some Azure stuff sitting over in US. And for a Swiss-based company, that’s kind of a little bit of latency across the ocean there. So consequently, it was easier to set up a local one. But if you’re not doing such large hops in terms of if you’re closer to the data source, then it’s less of an issue. But really, it kind of depends on environment.
John Hall 00:52:38.487 It’s something that’s worth checking out and testing yourself in terms of how much your latency is. If you’re looking at a collection that’s every five seconds, and your latency is one second, it can become an issue with large data sets, if it’s super big. But realistically, we’ve almost never an issue where it was that large. It was just more—it was more convenient. Because when you need to get data back and forth, it’s always a matter of bandwidth. So if we were pushing stuff from the US to Europe all the time, then we’re chewing up the line for no good reason, when we could put it just locally and then just drag only the dashboard report across.
Chris Churilo 00:53:22.404 Cool. Hopefully that answered your question, Anya. She says, “Yes. Thank you.” Wow. It’s already almost top of the hour. I didn’t realize I’d took up all your time with my questions. I apologize to the audience. I’ll just keep the lines open for two more minutes. If you do have questions, please let us know. And if you have questions later on, just feel free to send me an email, and I’ll be more than happy to forward on the questions. Looks like we do have another question. So let’s see. Do you have any advice for someone seeking to choose a time series database and struggling with costs when considering other solutions?
John Hall 00:54:11.671 Ooh, there’s an open-ended question. It kind of depends a lot on your environment—would be my honest answer. I would say Influx is very flexible, and it’s very nice. And particularly if you just need to get off the ground and started, there’s even perfectly good TICK Stacks available on the hub. So you can kind of just go out, grab an instance, get it spun up, and start from the beginning with a proof of concept practically for free. I mean, you just need a machine that’s got Docker installed. There are also other products out there. Prometheus is an example, which is an okay product. I’m going to say okay because I like it in one respect that it works, and it’s perfectly functional. It’s a little tricky in terms of learning curve. And that’s one of the things where I like Influx a little better because it’s frankly documented very well. And it’s at least for me, because I have a DBA background, the concept of just having a database that I can do a quick query against makes it very quick and easy for me. There are other products. Those are probably the two major ones. But it really does depend. Prometheus is an example [inaudible] is more Amazon based. So integration with some other services can be tricky, but there are plug-ins for that. So it kind of depends on your need. But if you want to send the question over or have more detail to share later, I’m happy to take that into more depth.
Chris Churilo 00:55:47.654 Well, so, Anya, I can do that. So if you want to just shoot an email, I can forward that over to John. Cool. Thanks. Well, John, this was really great. Thank you so much for your time today. And for everybody that’s on the call, this is being recorded. I’ll do a quick edit and then upload it, so you can take another listen to it. And I really want to thank you again, John. And thank you for your time. Thank you for sharing a lot of these really great best practices that hopefully, we can all emulate so we can have a nice calm environment similar to what you guys have. We can feel like we’re not always interrupt driven and running around with all these fires.
John Hall 00:56:30.604 Well, hopefully I didn’t put anyone to sleep. And I must admit, it is nice to have a calm environment. You get to spend lots of time on coffee talks.
Chris Churilo 00:56:39.502 Well, and then we get to do better things, right? So we all kind of are dreaming to get to this point.
John Hall 00:56:46.862 Yeah. And really the goal is to be in the state where you have some form of control over the environment, when you are in a reactive state.
Chris Churilo 00:57:00.702 Awesome. Excellent. Thanks again. And thanks everyone for joining us. And we will see you again. This Thursday we have, of course, our training. So feel free to join me again. And we’ll see you again. Thanks, everybody.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.