How Vonage Gain Insights of Their Stack w/ InfluxDB & New Relic
Jack Tench is a Senior Software Engineer at Vonage, a cloud contact center solution that helps organizations get up and running quickly and cope with fluctuating demand without needing the redundancy of traditional systems. Jack has a love for problem-solving, technology and engineering which has led to a natural passion for programming and his current career.
In this webinar, Jack will be sharing how Vonage uses InfluxData to gain insights on the performance of their cloud offering to ensure they maintain their 99.999% platform availability. Specifically, he will share how his development team uses InfluxData in conjunction with New Relic for Application Monitoring.
Watch the Webinar
Watch the webinar “How Vonage uses InfluxData and NewRelic to gain insights of their entire stack” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Vonage uses InfluxData and NewRelic to gain insights of their entire stack.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Jack Tench: Senior Software Engineer, Vonage
Jack Tench 00:02.200 Okay. So I’m a software engineer at Vonage. You probably haven’t heard of us before, but we’re a cloud software company based in the UK. We started sort of our own 10 years ago, now, we’ve recently grown quite a lot. And what do we do? We make a cloud-based contact center solution. Now that probably doesn’t mean anything to anyone, didn’t mean anything to me when I started. A contact center is just a nice word for call center. So essentially, the main thing we focus on is both inbound and outbound phone calls, emails, social media tweets, that kind of thing we handle for customers. We can queue them, prioritize them, optimize the way they’re handled. So there’s both, there’s sort of inbound support kind of calls and outbound sales calls. And it’s something that’s traditionally very, very sort of physical across the—you pay Cisco to put an expensive rack of equipment in your basement to handle 1,000 phone calls, and you do all your emails in Outlook, that kind of thing. We aim to sort of move all that into a modern cloud application. So that you can scale up and down on demand. For example, a lot of our customers are people like postal companies like DPD or we work with universities, companies that have very seasonal flows in the amount of work they have to do. So buying dedicated hardware to handle phone calls doesn’t really make much sense. You want to be able to bring on teams that can work from home temporarily and use the system from wherever they are. That gives us a lot of power to do some really cool stuff. So we integrate with Salesforce mostly, but we can integrate a load of other things. And we do a lot of work around the optimization machine learnings. The idea is you don’t have this really rubbish—not to turn it into a sales pitch, I’m just a developer [laughter]. You see, you don’t have this really rubbish experience of phoning up, wait in a queue, telling a computer your information then having to repeat that to a person. The idea really is you go through to somebody who already knows exactly everything about you and it’s preferably, probably the person you spoke to last time you phoned in, for example. So really cool.
Jack Tench 02:28.624 We’re based in Basingstoke, mainly. There’s about 150 of our 400 staff are in a development department, which is based here. We also have a team of developers in Poland, in Wroclaw. We do have offices around the world. This is where it starts getting sort of complex. And this is where monitoring tools become really important for us as a company because we release globally to all of our customers around the world. So we have customers in over 100 countries now. Something silly. So the challenge around that, particularly when we are looking for really high availability because customers—phone calls are something different, unlike Facebook. If you try and view a Facebook profile, it doesn’t load immediately, you don’t really think much of it. GitHub was down earlier, that caused a bit of chaos but, again, it’s a little bit more normal. If people can’t make phone calls, they tend to get upset very quickly. We could have a couple of seconds of downtime, we could drop a couple of dozens calls but if somebody’s been waiting in a queue for 10 minutes and then the call suddenly cuts out, that results in people being really upset for one error in a million compared to more normal error rate. So we have to make sure that it’s incredibly reliable with a really high availability. And as we’re pushing to release a new version of our product once a week. We’re now trialing twice a week, moving towards daily and a lot of our newer micro services-based infrastructure is releasing multiple times a day. The one thing that’s really important to us is being able to tell the difference between spot changes between a previous version and new version and spot things particularly quickly. So older monitoring tools we had before Influx, tended to have quite low monitoring resolutions. One of our main providers there was Opsview and it was sort of common to sample data once a minute, once every five minutes. Again, with things like phone calls and Twitter, if you have an issue for five minutes, your monitoring software can only tell you after five minutes. It’s not going to be your monitoring software that tells you. It’s going to be your customers on the phone. So for us, that was really important.
Jack Tench 04:54.211 Just a quick look at all of our other offices around the UK. So I’m here from Basingstoke but we do have sales bases in the US and as I mentioned, a development team out in Wroclaw. Who am I? Just quickly, I’m a senior software engineer. Been in Vonage for three years. I’m based in platform development team. So we are the team that sort of introduced and owned the monitoring solutions we’ve got and that included bringing in Influx into our architecture round two years ago now. Or a year and a half ago. So we’re into sort of support all of our main features teams at delivering the functionality. And we’re affectionately known as the platform sonic team to take a character from a platform game. And we also like to use a phrase, “We’ve got to go fast.” So one of the key things is we want a monitoring solution that we can add things into really quickly.
Jack Tench 05:53.876 So I’ve mentioned before, we are a global company, we have got six different clouds around the world. We’ve got a mix of AWS and traditional data centers. So it is not all AWS architecture and solutions. We’ve got VMware hosts, a lot of physical infrastructure to deal with as well. We have around 1,000 servers. I think recently that has just flipped to being a majority Linux CentOS servers, but we were traditionally a Windows shop, mainly dealing with .Net and Windows servers but have a hell of lot of hosts to keep an eye on. We serve around half a million requests a minute. So got a very high throughput which isn’t necessarily a good thing, not necessarily good to brag about. But when dealing with calls and lots of people sitting at a desk waiting on live data, the request rate tends to be pretty high. And we keep a really low response time on those, again as I say, the SLAs for phone calls. And when you interact with something over the phone, you expect it to be quick, not slow. Talking about phone calls, we have around—in our peak during the business hours, we have around 5,000 concurrent phone calls we’re handling. Which sort of totals up to around a million a day, on our busy days. We got more program language than I can count. So, as I said, we’re traditionally Microsoft and .Net that runs our core web servers. But as we’re branching out more and more into micro services, sort of machine learning and optimization systems, data processing for our analytics and statistics, we’re branching out to more and more languages. Yeah. So we don’t have one-size-fits-all solution. We need something that can easily integrate with nearly any language. If we want to time how long a method takes, we don’t just want to do that in C# code, we want to do it in some Go codes, and Node.JS codes, and Python codes. So that easy plugability was a big—it depends when we’re choosing monitoring solutions.
Jack Tench 07:57.050 So moving on to our TICK stack. So, as I’ve said, we’ve been on it around for about two years. We’re using the InfluxDB Cloud offerings. So InfluxData host our stack. They host our InfluxDB instances for us and a Kapacitor instance. So we have two InfluxDB clusters. We’ve got a dev and test one. It’s a little bit smaller, a more lightweight that allows us to sort of trial our changes and just keep the metrics for our test systems separate from production. It hasn’t been an issue recently, but in the earlier days of the InfluxDB architecture, there were a lot more data, that are a lot more upgrades of breaking changes and this sort of separation helped us test those in dev environment before rolling them out to production. And we’ve managed to do bad things before writing bad queries, that kind of thing, some definitely continuous queries before taking down our dev cluster. So it’s good to have a separation there.
Jack Tench 09:02.524 What do those clusters handle? Quite a lot. So I think this is just looking at our production one. We have around 2,000 measurements with 200 million series running around 20 queries per second from the various dashboards up around the office. And writing quite a lot of data, over 20,000 points per second. And currently, with that, it runs very smoothly. We have very little issues. We’ve hit a few hiccups from earlier versions of InfluxDB We’re currently on 1.2. And with the resources we’ve got, it’s very stable and very responsive. So we use Grafana currently; mainly for our graphing and alerting. We don’t really use Kapacitor so much for the alerting. We find it is a little bit too technical for all of—to be able to sort of open it up to our entire department through our sort of 100-odd users. So Grafana’s been much better for that, and it’s also mature earlier, has a bit more fully featured around the alerting. Yeah. So Grafana gets used very heavily. It’s now become the sort of go-to—combined with Influx, it’s become a go-to way if you want to graph something, and show it on a wall, or create an alert on it. If you’ve got a metric, this is the stack we use. And traditionally, as you’ll see later on, now what we started doing is if we have data that is in another system, the first thing we do is look at how do we port that into Influx. And that’s quite an easy thing to do, as well. And it gives us this powerful, single dashboarding alerting solution off a single data source. We have a small Kapacitor node that we’re sort of only just getting into. We’re a little bit behind. But because Kapacitor’s been a little bit less of a mature product, it hasn’t had quite as much love from our dev team. But we’re using it. Starting to use it for a few more things like data roll-ups. I’ll talk about it later on. So just to prove we do use it, this is just a photo of one of the walls in the office I took. Out of all of the monitors there, we only have a couple that show some custom dashboards. The rest is entirely Grafana running on InfluxDB. So we’ve got that up all around the office.
Jack Tench 11:26.066 So if we go back to the start, so late 2014 was when we chose Influx. We started trialing it in just sort of a quick spin-up your own instance, which is very easy to do. Running it in AWS, managing the stack ourselves. We were comparing it to things like Hosted Graphite, Datadog, and New Relic Insights. A lot of those had their own limitations. Graphite, we found, did not scale anywhere near to the amount of metrics we want to do with, as I said earlier, sort of writing to 20 to 30 thousand points a second. Graphite quickly just did not scale to that level and also the hosting fees—because of the poor scalability, the hosting fees are going to be a lot more expensive than it would have been for the InfluxDB Cloud offering. We looked at Datadog. One of the main things that put us off a little bit at the time is their—while we were attracted to the open source nature of InfluxDB and we’ve had a lot of custom with data sources that we want to be confident that we could throw in query in our own way and have the power and flexibility to use the tool rather than getting another sort of slightly more locked down proprietary system like Opsview and New Relic which we also use, but have limitations with because we don’t have the power of having access to the database itself. New Relic Insights was great. The only problem is at the time it was nowhere near as fully featured, and it also does not scale anywhere near as well, particularly when it comes to cost. Some of our main requirements, as I mentioned earlier, was the high resolution. So, at the time, our main metric solution was Opsview and we sort of had one-minute, five-minute intervals and it was almost useless. There’s no way that we could react to change at that rate. The only thing we reacted to was customer complaints [laughter]. So the one resolution and run of a second resolution on metrics allowed us to sort of very quickly spot when something was going wrong, or the behavior was changing in the products and react to that far quicker. And also try test out changes so we could easily make a change to infrastructure and see that immediately.
Jack Tench 13:46.554 And we had a good few other issues. A lot of time things don’t happen on one-minute intervals. We had a lot of different incidents where we would have CPU spike for 30 seconds, requests slow down for 30 seconds, and you really need that high granularity to work out what caused what in our pipeline of applications; what was the first thing to start going wrong and even just seeing these really small blips. We wanted to sort of have plugin-based collectors and agents. So we use Telegraf now for that and it is great to be able to just ease—rather than having to write everything ourselves, we want to be able to easily collect data from all of the different services and tools running. And as I said we want to add custom data, we were—we had a lot of different—we have a data in a lot of different places already and we want to be able to import that very easily. Or in the case where we have a live instance, we’re investigating some weird issue, we want to be able to write a quick script to count how long, how often something happens, or time how long something is taking. I’ll talk about some of the cool use cases we’ve got later on. But with us, that’s now become a really powerful tool to have in our toolbox is we can really now within sort of 10 minutes, we can whip a script together, throw it up in to say an AWS Lambda that starts poking services and gathering metrics and firing them up to Influx. Another thing we want is control over roll-ups and retentions. That was a limitation with some of the ones above, particularly New Relic Insights. Didn’t really have the control over retention times and how the data’s rolled up.
Jack Tench 15:22.106 Our first use case is quite an interesting one. So this is sort of what we built the stack out initially for, was for our database clusters. So we’ve got, I don’t know, probably sort of in the order of 40 or 50 different database servers around the world. Masters, slaves, reporting slaves, that kind of thing that are all kept in sync with each other. So write some [aids?] to masters, they replicate down to their slaves, they then replicate across regions to our different data centers and availability zones. And we found that when we were scaling, when—so we’re sort of doubling our customer base year-on-year at the time and one of the often bottlenecks we hit into was the database. It’s a common bottleneck to have. And what that would often affect is the database replication. So we’d find slaves which start being able to not keep up, which would cause your application states to become inconsistent. And the problem is that could happen quite quickly and we want to be able to react to it quickly and our existing tools just didn’t allow us to do that. So that was our first great example. So we used the—we used a tool called pt-heartbeat which is a part of the Percona toolkit and essentially what it does is it has its own tables. It writes into your masters and works out how long that write takes to actually show up in the slaves. And, at the time we wrote a quick python-daemon which we dropped onto our database servers that ran that script every couple of minutes or every second and sent that over StatsD, which is a UDP protocol, to at the time a StatsD server. And that can be rewritten now as a Telegraf input plugin which would be quite cool.
Jack Tench 17:13.133 So this is all the infrastructure we came up with initially. So the problem is, at the time, two years ago, we didn’t have a lot of our infrastructure automated. We were still in the process of rolling out Chef to do that so that we could make automated changes across those thousands of servers. So, at the time, it had been quite a burden to say, “Okay, on every server we need to install a new agent and configure it, and manage that agent.” So what we did instead—we’ve been quite happy with this and I think it’s a pattern we’ll keep for a while—is have a centralized StatsD server. So this means that our different applications—in this case, initial two was our .NET web servers and our MySQL servers running this little python script. Could send the data over the StatsD, the UDP protocol to a central fallout region, StatsD server, which would aggregate it all, and then write that to InfluxDB where we could view it in Grafana. As time’s gone on, we’ve replaced StatsD with Telegraf, so now that is the—we still have a central server in every region that’s running Telegraf, listening for those UDP packets, and sending them on to Influx. This helped fix some scaling issues we had with the old Node.JS StatsD server. And it has much better support for the InfluxDB-style tagging of metrics, and also had the benefit of now it’s Telegraf, it’s a proper metrics gathering agent. So it can also monitor the CPU disk network, that kind of health for us, on that box. Once we were comfortable with Telegraf, we realized it’s actually a really good agent and what we can now start doing—now we have nearly all of our infrastructure automated. It’s really easy to add a new service onto any one of our servers. So we can now run Telegraf locally on our—for example in this case, we’ve migrated MySQL to Telegraf. We’re running it locally there. It means as well as gathering a little heartbeat tool to show replication lag, we can also gather all of the different MySQL status metrics with just a couple of lines of config in the Telegraf plugin. That’s being really popular now as we’re moving towards docker and containerization. It’s really handy to be able to easily drop this agent into every single one of our services as they spin up and spin down to monitor both the local system health and easily plug into whatever service it happens to be if it’s MySQL or RabbitMQ, our queuing engine. So this is what we’re moving towards now and it’s really powerful. We’re really happy with using Telegraf as a local agent. But, as I say, we’re still keeping Telegraf as a central server to listen to StatsD from anything. So that means we can still easily pop some StatsD code into any service, whether it be legacy or experimental, fire over there, make its way into Influx.
Jack Tench 20:15.351 So the database replication delay that I was talking about, so it’s quite an interesting use case. Here’s sort of an example of our dashboards. So here we’ve got—slightly blurred out [laughter]. It’s all of our different database servers in our different regions and how far behind they are the true master in that region or even globally. So can see normally nearly everything. This is on a very small interval. We can see the occasional blip to a few milliseconds of delay which is the way things normally behave. But it looks quite interesting, we have an instant. This isn’t actually something going wrong. This is one of our staging environments which is a little bit under-resourced. And overnight when we’re running reporting scripts, it can fall behind because what we found, after we wrote this tool, is that it’s good at spotting these incidents where the databases are taking too long to get into sync with each other, but it’s also—that measure is one of the best measures of database load. As soon as a database is starting to struggle at all, it’s far easier to tell—rather than looking at CPU stats or network stats—it’s far, far easier to tell if that DB is starting to struggle serving requests is if this replication delay becomes above zero. So here we can see in a cluster in one of our staging environments, all of the servers at the same time start getting high load on them. They all started falling behind in their replication. And then we can see as each one starts to recover, catch up and then we can see where we’re back at. At this point this server is probably not behaving normally, but we know that it might be causing weird state issues in the application for requests going to that because they’re going to be, in this case, three minutes out of date and we can then see that database server recover and be consistent with the master. And this has been—it was a really powerful first use case of the tool and kind of sold it to the rest of the department.
Jack Tench 22:15.829 However, it’s not all roses [laughter]. There have, however, been some growing pains. So we started using InfluxDB November 2014, a version 0.8 and it was quite early in the life of Influx and the development pace was very rapid which was a great thing. Nearly every issue we had, nearly every limitation that we’ve had with it, it’s been in the very immediate road map and it’s been solved very quickly. And we were aware going in that it was a young product and was probably going to change. But this has caused a little bit of friction, a little bit of overheads. We’ve had quite a few between version 0.8 or 1.2. I think there was a big database schema migration in 0.9 and 0.1 and possibly another one in 1.1. So those have caused a little bit of pain to make sure that we managed those properly. We’ve had API changes which meant that we’ve had to make sure our tools were up to date. For example, we used to write to InfluxDB using JSON. That protocol was deprecated. Some of the reading queries and APIs have changed slightly, but the community is pretty rapid and because we’re using open source graphing tools and open source aggregators and plugins to write there, it’s been normally pretty easy just to upgrade everything and move on. We’ve also had some—around the InfluxDB Cloud, some of the infrastructure and the way that’s managed has changed since we started, but that’s definitely for the better. When we first had our stack hosted with them, we used to have a few issues around reliability, and particularly around monitoring. It often seemed like we were the ones to constantly raise cases and spot things going wrong. But that’s definitely turned around now it’s on the new InfluxDB Cloud infrastructure. We’re finding that we get emails from Influx when they notice things might be about to go wrong rather than us contacting them and raising case. So that’s massively improved. So probably worth the pain. Yeah. And, in the end, we never really had any data loss or any heavy impact on the true end-users of our stack, the ones flowing data in and graphing it and relying on the alerts. In general, we’ve been able to keep a high availability there. And I think now the stack is relatively stable and I think it’s probably worth it.
Jack Tench 24:54.344 Moving onto sort of scaling, sort of some of the other issues. We haven’t really had too many issues around scaling. Some of the ones we’ve had have been fixed by newer versions. The main one that hit us a few times now is series cardinality. So series cardinality is the number of unique tag combinations you have in your data set. So a cardinality of one is if you’ve only got one tag with one possible value. But when you start to have more and more tags with more and more possible values on a single measurement, you end up getting the product of all of those possible values becomes your series cardinality. And the way InfluxDB is written, the higher your series cardinality, essentially the more memory usage. So it becomes memory-bound. So we’ve had issues there where we’ve essentially ran out of memory on the stack and it all starts to fall apart a little bit. So we currently do have a lot of series. We’ve got 200 million, which is sort of on the upper end for the current version. But it is a very responsive cluster. That doesn’t mean that we haven’t had issues that I talked about. The main incident has been when—and it’s an easy mistake to make because if you’ve just been introduced to InfluxDB, tagging, and these stats, how to use StatsD in tags, and you go out there and you’re like, “Oh, right. I want to see how long this piece of my code takes to run,” and you think, “Oh, what tags can I put in? It’ll be handy to put in a tag for maybe one of the parameters to this command.” And another one that seems to happen a few times now is think, “Oh, you know what? I’ll put the transaction ID or the call ID or some generated GUID in as a tag, so that when I’m actually looking through the metrics, if I see a really high one, I can grab the GUID or I can group filter by GUID and then I can then see associated metrics. Or go look through our logs to see what happened with that particular request.” Problem is a GUID is unique and because you now have basically an infinite number of possible tag values, your series cardinality is just going to go up and up and up and up and never come back down again. And as helpful as these tags seem, it’s happened a few times where we’ve started to use something like that and it has caused the database cluster to essentially fall over.
Jack Tench 27:21.583 Hopefully, this should be solved in Influx1.3 or improved. Putting random values in tags still isn’t a good idea, but the new time series index engine should reduce the memory load of high series cardinality by being able to take series that aren’t being used out of memory and also adding some show cardinality queries to aid debugging these. So this is sort of our biggest lesson to take really, is to be careful about your tag values. Do think about and make sure that people are aware of the impacts of putting essentially unique values into tags. So because Show Cardinality isn’t around yet, which is coming Influx 1.3, we’ve got a little script. It’s a PowerShell script because I’m a horrible, dirty Windows developer. That essentially can go to the Influx DB database, grab all of the series, which will be a massive JSON file. Because there will be a line for every single—so as series cardinality of 200 million or a total series count of 200 million for us. It’s going to be 200 million lines back from the show series request. So just little script to group them by the measurement name. And, in this example here, we can see at the time we had a new metric was added called CallCenter Make Call Request Duration that I think this is the one in the case had a GUID in one of the tags. So it was added by a team. They thought, “You know what will be handy? To have the call GUID for the Make a Call Request.” And within a sort of day of being in the product, we had a series cardinality of over a million because this is going to go up by one, we’re going to get a new series every single time there’s a new call. So this has been really handy for us to debug sort of where our memory’s going to in the cluster, really handy to do. But hopefully in 1.3 that should be partially built into the query language.
Jack Tench 29:21.557 I want to spend a little bit of time just talking about some of the other tools that we use and sort of how InfluxDB compares to those. We use New Relic and that is great for our application performance monitoring. The main reason is its automatic instrumentation of stuff. So as good Telegraf is with its plugins, things like deep instrumentation into our web server stack, automatically being able to do things like performance profiling, working out how long particular methods are taking, automatically instrumenting every single request, getting things like response times, working out what database queries are run during that request is really powerful, and that’s the main thing we use it for. It does have an Insights Time Series database, which is great to query that data. But one of the problems is that’s expensive and the Insights Time Series Database doesn’t scale anywhere near as cost-effectively of Influx. So considering we have 200 million series, let alone data points, we can’t even keep 200 million data points in Insights. So yeah. We’ve got 200 million individual lines so scales a lot differently. And the alerting on that platform is quite limited, although the beta does look pretty promising.
Jack Tench 30:37.862 One of the cool things we’re doing to aid this is because New Relic is expensive and it’s got a limited retention period, so we only have a week there. And that’s pushing the limits quite a lot with the amount of data we’ve got in. And it’s nothing compared to InfluxDB. So what we do, we have a tool which reads the key data that we care about that’s been automatically collected from our web stack and stored in New Relic. We read those time series from the New Relic Insights database and then export the results of the data into InfluxDB so we can keep it for much cheaper, for much longer. And it gives us that flexible retention periods for the data that we are about. And we’re actually the part—one of our dev teams at the moment is working on, in fact, solving the automatic sort of more deep instrumentation in .Net by writing some plugins and wrappers around our .Net code so that it’s gathering similar metrics to what New Relic does automatically, and sending them via StatsD to Insights, as well. So maybe one day, we won’t even need New Relic anymore.
Jack Tench 31:47.802 Another monitoring stack we’ve got is Opsview. So basically, this is based on polling clients and running script, check scripts on those clients. It was great at that time. It’s kind of an old architecture, this sort of polling idea. As I’ve mentioned earlier, one of the main things was it’s slow. It doesn’t scale well. It has a really low resolution so we get only one data point a minute. So really, for us, this worked great. It was easy to wire into our existing infrastructure because it was polling at that time when we couldn’t automate infrastructure. We didn’t have to make changes to all of the servers because it kind of pushes out its checks and runs them. So it worked well initially but it scales incredibly poorly. It’s just a quick screenshot there. I went to one of our dashboards. It’s a very simple dashboard, only got a few graphs on and it takes around 10 seconds just to load the page. You can see it’s sort of a few seconds per graph on the dashboard, and that’s one of the faster pages we’ve got there. It’s common to sit around waiting minutes to be able to do anything. And another thing, because it doesn’t support tagging like InfluxDB does, it’s very hard to support a newer, more dynamic infrastructure where things are spinning up and spinning down on demand. You can’t really build a dashboard to show that. They’re hard-coded. Whereas with tagging, it’s really easy to filter to a particular region, automatically include every show the CPU, for every host of this service in this region becomes really easy to do. Whereas, that wasn’t possible. So we’re now pretty much entirely phasing out Opsview at Vonage. We’re looking to replace it entirely with a TICK stack. So that is great because I hated it.
Jack Tench 33:31.880 Another thing we use that just deserves a quick mention is the ELK stacks. We mainly use this for our logging and, to be honest, this sort of seems to be combined with Influx for our high-level metrics. Our sort of counting things and graphing things. ELK is really being used for our log searching and it’s great for that. The downside is graphing metrics and logs is a little bit limited. It can get pretty fiddly and it doesn’t really work well with—oops, a bit too far. Doesn’t really work well when looking at trending and historical data. So actually, what I’ll quickly show you later is we’re actually now looking at graphing. So doing some of our log graphing, so actually even moving some of the logging stuff from our ELK stack into InfluxDB because it actually is even better than what you think would be the tool for the job for graphing logs. It actually can be great to do that in Influx.
Jack Tench 34:24.396 So sort of to summarize towards the end, I just go through some of our interesting use cases and, people, feel free to ask more questions about them. I’ve got a database replication delay, one which I talked about earlier. I’ve mentioned archiving New Relic Insights data so that we can sort of pick up the things we care about from a very expensive Time Series solution database into Influx. Where we’ve got more power, more flexibility. You can control retention and it’s cheaper and scales far better [laughter] and has better alerting. We’ve got log profiling. So that’s what I mentioned just now, we’re actually starting to write AWS Lambdas that read out our logs over HTTP from a central store. Use a Google Clouds library or implementation to try and calculate thumb prints and logs, so look for things like stack traces in logs and unique signatures so we can get a thumb print for a log. This helps us then group by that in Influx, so we can do counts. And we then write that through InfluxDB. I’ll show you an example in that dashboard later. We normalize our data now using Kapacitor, so we’ve got a lot of MySQL metrics coming in. Huge number of MySQL metrics, some Telegraf plugin. The problem is you might find with MySQL metrics is a lot of them are continually incrementing counters. So they just go up. They look like a diagonal line going up into the right, which isn’t very useful. So you can, when querying InfluxDB, use a derivative function so that you actually get a normal graph that is your rate of select queries rather than the total count of select queries that will ever run. Which is much more useful. The problem is this can get quite painful and hard to work with, particularly when then trying to aggregate them over longer time periods and that kind of thing. So now we use Kapacitor to convert those into more normal metrics rather than continually increasing metrics. So we pre-calculate that derivative and store that in InfluxDB which makes a lot of our dashboards a lot simpler and easier to use. Our service health checks. So quite a lot of our new sort of micro services are popping up, using a pattern of pinging a heartbeat. Just StatsD which ends up in Influx every second or so when they know they’re healthy. So it makes it a really easy way to see if a—it’s a really nice quick solution. If you’ve already got this infrastructure there to do like health check monitoring around your system, so we can alert if there is no data and perform actions based on that, either in Kapacitor or in Grafana.
Jack Tench 36:55.419 Other things we do, so we store releases, product versions, and config changes in InfluxDB. So we can then use those as annotations in Grafana. And one example I’ll show you later is where our performance team, every time they do a performance, test a new version of a product, it gets an annotation next to it. So you can see which version it was, which type of performance test was run, and that kind of information. Graphing and alerting on data AWS ELB logs. So I thought this was quite an interesting one. With AWS you often have the problem where things aren’t very visible to you, the internals aren’t very visible. And we had issues where our ELBs just weren’t behaving right, they weren’t scaling properly. The autoscaling, for instance, weren’t happening properly. But even then when it seemed like we had enough instances, things still seemed slow through the ELB. And all you get is log files. They’re not massively helpful to view numbers over time. So again, it’s very easy to do, is to trigger an AWS Lambda, whenever a new log file is written from the ELB, to look through the log file, pick out all the numbers and interesting metrics, insert them into Influx, and then suddenly we can have dashboards for data that you can’t normally see in AWS. And that really helped us get dig down into issues with internal processing times. And tagging those with the scaling events that were happening. It’s really powerful. We’ve also written a Python tool to scrape Windows Perfmon counters. But again, as a community moves on we can probably move that towards Telegraf.
Jack Tench 38:26.406 Cool thing that we’ve just started doing this week, I think, is trying to use Kapacitor to trigger autoscaling. So we’re not entirely an AWS. And also it can be—some of our services, particularly when dealing with phone calls, you can say, “Oh, I don’t need this instance anymore.” But this instance could be dealing with a telephone call that goes on for a couple of hours. So it takes a very long time to shut a particular instance in a service down. Because you have to wait for that phone call to finish—all the phone calls to finish. So some of the traditional autoscaling things available in AWS aren’t quite powerful enough. But it’s quite cool, we can use Kapacitor to detect when we start to see high ramps in traffic. Because we do deal with very, very spiky traffic. AM in the morning, when all the call centers open in the area, we suddenly get a huge ramp in traffic. So we can really quickly detect when we see a certain ramp on the load on services or an SLA. We can actually fire events from Kapacitor to start managing the autoscaling, which is really cool.
Jack Tench 39:30.108 Another thing we’ve got is we use VoIPmonitor, which has some infrastructure to answer call quality metrics. Things like packet loss, that kind of stuff. Tries to build up call quality scores. Was there packet loss, was there delays in the audio, that kind of thing. The problem is, it had poor alerting and graphing capabilities. So it’s now becoming our go-to pattern and this is a good summary of it [laughter]. Our go-to pattern is if we’ve got some node datas that are numbers, is to get into InfluxDB. And then it’s now one place where we can pretty powerfully flip on tags and alert on those. So that’s sort of summary of all our use cases and the main thing that we try and do for our programming languages, not for our infrastructure, but for our code is to make the metrics really easy and simple to write and use. Now it becomes really common. Particularly, it’s releasing quicker and quicker. If we’ve got a bug, or some issue, or we’re concerned on how often is this happening? How long is this piece of code taking to run? We write and make sure we’ve got libraries available in all of our languages. This is the example of C# making it really easy to time a certain thing and add some tags onto that measurement. This means it’s really quick just to drop in a couple of, “Oh, how often do this happen?” Drop it in, release to all of our customers the next day. Then you can have some really cool insights to actually how some detailed behavior of your application is working. So I was going to move onto dashboards. Time is getting on. I will quickly look if we’ve got any questions already.
Chris Churilo 41:16.109 I think you’re good right now.
Jack Tench 41:17.557 Good. Cool. I couldn’t even find where it was, so thanks. Okay. So I’ll quickly go five minutes just tour you through some of the dashboards I’ve got. Hopefully, without spilling too much confidential data [laughter]. So a good example here, we’ve got a logins dashboard where we’re sort of tracking users logging into our application either through normal username and password or single sign-on. You can see we have these massive spikes in users at 8:00 AM and 9:00 AM in the morning. So you have this really spiky behavior use of the platform. It allows to get these—really easily get insights into sort of how often are we using single sign-on versus password-based logins? How popular are single sign-on providers? Obviously, in this case, Salesforce massively more popular than Microsoft. Gather some quite cool—and, again, it’s just so easy to add in the code. It’s sort of 10 minutes to do and then we can quickly get really interesting metrics like, you can see in this case, using single sign-on has a much high success rate than using usernames and passwords. People are bad at remembering usernames and passwords. A cool usage for timing things is somewhere down here. No, it’s broken. That’s a bad example [laughter]. We also had a—let’s see if we can fix it quickly while we’re here. I wonder what it is? Production cloud metrics. No. So what we were doing is timing the password. I know what it is. I need to—redrawing of—changed—here we go. So, in this case, we’re measuring how long it takes to validate a password. So to have secure passwords, you don’t want to be able to calculate the hash too quickly. You want a hashing algorithm that’s slow. You want a good number of iterations for it to be secure, so they can’t be brute forced and cracked. And here you can see we’re tracking the upper mean and lower time it takes for us to hash a password, trying to keep it around this sweet spot of between 100 and 200 milliseconds. If things start getting faster than that, we know you need to make that algorithm more complex, add more iterations to make sure it’s still a secure password. So that’s a cool, real easy way to keep track of—cool use case for timing how long something takes.
Jack Tench 43:32.649 What else can I show you? So if we switch to our devorg. I mentioned moving logs into InfluxDB. So it’s something we started doing now. So just using our—it’s using this stack driver error reporting from Google Play platforms, same algorithms they use to generate a fingerprint for a log message. And now we can actually—so this is just a test running on some test data. We can actually really easily graph and trend how often errors are occurring based on the log message. And, actually, it’s nearly impossible to do this in the sort of our traditional normally quite powerful elastic search stack, but it becomes very easy now to be able to trend how often an error is happening over time. I’m trying to think what else I’ve got, so we also got some interesting usages just to see what we can do with alerting. I guess I’ll go back to the login dashboard quickly, we can show you quickly an example of alerting here. So this is where we can set up, for example, a failed logins alert is really easy to do. So it’s just the same as a dashboard. However, in this case, we get the alert tab and we’ve said if there’s above 30 failed logins in five minutes, then we throw a warning. So this means that if I were to go now and make this graph go above the 30 line, our security team will just get a notification email to let us know that there is a chance somebody might be trying to brute force users’ passwords.
Jack Tench 45:04.139 Another thing I can show you quickly is just doing sort of some—the power when you’ve got dynamic instances and say the power—so as we’re moving to infrastructure where we don’t have a fixed number of instances. They’re dynamic, being spin up and down on demand. You might want to compare behavior between different ones of those. And, again, like I said some of the other tools just don’t really work well with that. But here we can have a dashboard where we can quickly select and unselect different instances. So if I just select two for example, I’d see just the data for those two. If there’s a few more I wanted to compare them to, can easily add more in, compare all the different—a bunch of different services and compare their behaviors, which is really cool to be able to do nice and easily by the power of sort of having these tags. So if I just refresh things. The other thing we’ve got showing here, we’ve got some dynamic intervals which allow sort of the resolutions change. We also have multiple retention policy. So, as I mentioned, one of the things that we use Kapacitor for is to roll-up our data into different resolutions. So this means all of this data we keep for up to two years, just as it’s streaming in through InfluxDB, Kapacitor is grouping that into lower resolution data, storing it into a different retention policy. So two different retention policies at the same time. So that’s pretty cool to be able to do. And I think that’s most of the interesting ones, without showing too much. I should skip ahead right to the end. Is there any questions?
Chris Churilo 46:53.273 Wow, that was cool. I could listen to you all day [laughter].
Jack Tench 46:56.042 Yeah. I can go on for longer. I don’t want to run over too much today.
Chris Churilo 47:00.106 But you guys are monitoring pretty much everything in your infrastructure which is just really cool to see. And it’s really great to see you using so many different tools for their intended purposes as well.
Jack Tench 47:13.548 Yeah. So I think that’s definitely a summary of where we’re going, is to try and have everything that you could count, we want to get into Influx. Even if it is coming from a different tool that’s doing data analysis or something like that. It’s porting that also and it’s in sort of the one Time Series database with our one graphing platform, alerting platform. And now we’re starting to dabble more in Kapacitor. We’re finding it is becoming pretty powerful to starting things like autoscaling from that data, and this sort of more advanced analysis of it as well as cleaning the data up as it comes in. But the main downside with that is, as I mentioned, we’re not using Chronograf, we’re not using Kapacitor really for alerting, because we’ve found that in the current state, Grafana is pretty powerful for that, particularly around the alerting side. You can also just look through all of your alert lists, see which ones are broken, see which ones have are reporting error. It’s a lot more user-friendly than trying to use Kapacitor for that.
Chris Churilo 48:22.839 Yeah. No. Well, I mean, we love the Grafana tool set and I’m a big believer in you need to use whatever tool makes you the happiest. And so Grafana, alerting, go for it. So we do have one question in the Q&A section and so basically—and I’m going to change it a little bit. So what business impacted not just InfluxData but all the monitoring tools, and all the tool sets that you shared with us today have on the—on your to time to market or your revenue, or customer happiness?
Jack Tench 49:03.483 Yeah. So I’m probably not the best person to ask. I’m just a lowly dev [laughter]. So as much as I probably should be caring about those kinds of things. I would say the biggest impact it’s had on those is the time to market really, for us, both around new functionality and bug fixes and things like that. We’re aiming to release more and more often. And two years ago, releasing a new version of the product used to be very manual—it used to require a lot of people sitting in front of monitors, staring at things, trying to work out where the product’s gone wrong. We’d have a bunch of people frantically testing things as we’re releasing. Whereas now, it’s a very hands-off experience and it’s a lot—we can have a lot more confidence in our infrastructure and our product every time we make a change to it. And we can be more confident that we will be told very quickly if something’s going wrong over the sort of—before there used to be a big feeling of a lot of risk involved. We didn’t have the visibility we needed and we didn’t have the response times that we needed for when things go wrong. So it’s allowed us to really make changes quicker and quicker. And as I say, we’ve moved from sort of releasing once every couple of weeks to now aiming for every day for a large product. And for our newer infrastructure, our micro service infrastructure, I think the current record is around eight full global production releases in one day. And that’s based on they have the confidence that they can just automate and release out through the platform. And the monitoring system will tell us if there’s any issues as the migration’s happening to the new version of the product. Rather than us having to sort of [laughter] watch over it all the way through, and sort of be, “Is it okay? Is it okay? Somebody check if this one’s still okay [laughter].” That has been really great for us. Yeah.
Chris Churilo 50:55.758 That’s so impressive, especially given the size of your engineering team. I’ve always been a fan of doing lots of little releases. But I know just coordinating—well, first of all, coordinating that effort. And then second—and giving such a large team the confidence that things are going to break. But you guys are going to be in control of it, is a pretty large feat on your behalf.
Jack Tench 51:21.406 Yeah. So as I say, the other thing is that—yeah. So we’ve come from this massive monolith project and infrastructure that was horrible to maintain and we had all the 100 members of the development team working on the same code base, all the changes being released at the same time. Adding the extra visibility to that has really helped speed up the process of making changes or releasing changes there.
Chris Churilo 51:51.083 Yeah. I personally remember the time, so we would have to spend hours in the conference room going over, okay, here’s going to be our roll-out plan, and here’s the role back procedures, and [laughter]—and I’m like, “There’s got to be a better way.”
Jack Tench 52:03.045 It’s becoming much more just business as usual. It just happens, and we’ll know before something goes wrong. So, yeah, that’s been great. And it’s also vital for the newer modern sort of micro services as well because there becomes more and more hosts, and more and more services and systems to monitor it becomes impractical to do it manually. So making it really easy to instrument metrics into those. And the alert on those is great. Yeah.
Chris Churilo 52:33.062 Oh, it’s fantastic. All right. We just have a couple of minutes left. I want to remind everybody, if you have any questions, put them in the chat or the Q&A panel. If you’re a little bit shy, that’s okay. We can also get your questions later on. You can send me an email, [email protected], and I’ll be happy to forward it on to Jack. And, Jack, I really want to appreciate—to tell you thank you for this really in-depth presentation. I learned a lot just by talking to you just a few weeks ago, and I learned even more. And I’m pretty impressed with the amount of monitoring that you guys do, and at the level monitoring that you guys do of all your infrastructure.
Jack Tench 53:18.578 Yeah. Yeah. We’re really keen to push more. And we’re also sort of as a company becoming more and more keen for open source. So if there’s anything that anyone’s—sort of around our use cases that anyone is interested in, definitely drop me or Chris a line, and I’m sure we can share our tool set with you. And look at even open sourcing it if we haven’t already because I think that’s been the real powerful thing for us as well. InfluxDB being built on a sort of open source stack, with open source tools, it’s really easy to both contribute, or wait for somebody else that does the contributing for you [laughter]. So you gain this cool new functionality, there’s this cool stuff, new stuff we can do sort of every month. It’s great.
Chris Churilo 53:55.856 Yeah. I know. Obviously, we love open source and we agree with you completely. And if you look at our contributions you can see that, by far, our contributions for open source far exceeds what we do on the closed source side. And it’s going to continue to always be going down that path. In fact, Paul Dicks, our founder, even just recently blogged about how important it is that we as an industry really continue to support open source because that’s really the only way that any kind of real innovation can happen. All right. It looks like everyone else is pretty shy with their questions, but I know what would happen is we’ll get a bunch of questions later on. So once again, Jack, thank you so much.
Jack Tench 54:44.182 No worries.
Chris Churilo 54:44.081 And we will probably speak again. And if anybody on the call has any other ideas of things that you would like to hear in any of these webinars, any kind of specific trainings that you want us to go over, please feel free to shoot me a line. In today’s webinar, Jack did spend some time in Grafana, and I’m actually working on a new webinar with Grafana, we did one in January that was just about basically how to use both our products together. And we’re working on a more advanced version of that training, and we’ll be announcing that pretty soon. So once again, Jack, thank you so much, and I hope you have a good evening.
Jack Tench 55:31.330 No worries. Thanks.
Chris Churilo 55:33.590 Bye-bye everybody.