In this webinar, Tim Koopmans, CTO and Founder at Flood IO will be sharing how their On Demand testing service uses InfluxData to provide the insights into their customers’ performance tests. In addition, he will share how they use Kapacitor to help them automatically spin up test environments for their customers and provide them with a real-time view of the test that they run.
Watch the Webinar
Watch the webinar “How Flood IO Relies on InfluxData for Performance Insights” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How Flood IO Relies on InfluxData for Performance Insights.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Tim Koopmans: CTO and Founder, Flood IO
Tim Koopmans 00:00:00.457 Cheers. Thanks, Chris, and thanks for inviting us. Hi, everyone. Yeah. As Chris mentioned today, I’m going to be talking about how Influx is used at Flood IO. A couple of personal details up there on LinkedIn if you want to check out my profile or get in touch after, and you can find us at https://flood.io if you want to try it out at your own time. So I guess when I was putting this presentation together, I want to go through the following points. I want to talk about what Flood actually is and what we measure because that will become important as we talk about how that relates to Influx. I want to share with you some of the early infrastructure designs that we tried before we were using Influx because I think that paints a picture of what wasn’t working so well for us, and it gives you some of the, I guess, reasons that we changed our design goals around the infrastructure that we use. I’ll also talk about some of the design considerations that we use around managing InfluxData and some of the problems and challenges that we’ve had along the way, and I’ll probably finish with some of the points that I think we can do better in the future with.
Tim Koopmans 00:01:11.715 Now, let me get started. So what is Flood IO? It’s basically a distributed load testing platform. So what we do is we take open-source tools like JMeter, Gatling, and Selenium, and we let users run those tests on multiple grid nodes or what we call grid nodes. So effectively, those grid nodes are what you’d refer to as load generators, and they run across many geographic regions. So we have customers starting hundreds of these load generators and producing results in real time, and so the key thing is what we’re trying to do with Influx is obviously collect and aggregate all of that performance data that the tests are producing in real time, store all that in InfluxData, and we use Influx cloud along the way to help us manage all that. So I think I’ll try and do a quick demo because it’s always good to sort of see the product. When it’s been too long, this is what it looks like on screen. Let me see if I can just switch to my browser. Let’s go.
Tim Koopmans 00:02:20.825 Okay, so when you come into Flood—I’ve got stuff going everywhere. Here we go. All right. So when you come into Flood, there is basically two concepts that you need to understand. So there’s the floods, which is the load tester that I was talking about, and then there is the grids, which is the infrastructure that you run your tests on. So you can go into your Grids page, and you can select whatever region that you want to run grids in. So we support 14 regions, which is all the Amazon regions and basically, we make it pretty simple for customers. So if I want to start, I don’t know, 10 nodes in US West Oregon, I want to run them for a couple of hours, and then that’s pretty much it. So we can launch that grid, and that’s just basically going to start up all the infrastructure that we need for the load test. And so what that means is you can have, like I said, hundreds of nodes distributed across multiple grids in different regions, whatever you sort of fancy or whatever your test requirements are.
Tim Koopmans 00:03:18.621 So I’ve got this 16-node grid running here. It’s been running for half an hour and it’s going to stop in a few hours, and we can find generic information about how it’s running, CPU memory performance, all that sort of stuff so that we know that we’re not kind of smashing the grid nodes into submission. And then it gets us to the other part of Flood, which is actually running tests. So this test is actually finished, but it gives you an idea of what sort of data that we collect. So some of the measurements that are common to performance testing are things like concurrency or the number of active users, response time in milliseconds. We’re interested in transaction rights. We’re interested in network throughput, latency, past transactions, file transactions, that sort of stuff. So that gives you an idea of the kind of measurements that we’re going to store in Influx, and we report all of this sort of average data via fancy-looking charts. I guess we report tabular data, and we also give customers a bit more insight into each individual transaction as it’s running in the test. So we give some descriptive stats around response timings, and we also collect all of the error and status codes that might be present for that particular label. So in this particular transaction called individual ranking, we’ve got a pretty high percentage of failures, and what that does is it lets customers sort of zoom in and have a look at what was going on in the test at that point in time. So this is like a HTP500. We collect the trace information, and it basically gives people sort of actionable data that they can then go and solve performance problems on the application under test. So that’s Flood in a nutshell. Let’s get back to here. And I should check. I don’t know how to run everything in the same screen. There’s no questions, I hope—hopefully. Maybe, Chris, you can let me know if something pops up.
Chris Churilo 00:05:20.832 Absolutely.
Tim Koopmans 00:05:21.812 Yeah. So like I said, I wanted to share what our early design was, sort of a naive design. This is going back around five years ago, and what we effectively had—actually, ignore the Docker containers. We didn’t have Docker back then, but we had essentially a bunch of application servers. We had database servers, Memcached, Redis. We had a bunch of workers running asynchronous code, and what we would do is—well, we didn’t have Influx at that time, so we were using Elasticsearch in its pre-v.1 sort of days. And basically, each node would operate in a cluster of up to 30 nodes and, basically, each node would be running Elasticsearch, and that would operate in this kind of Elasticsearch cluster using all this sort of clustering mechanisms from Elasticsearch itself. And we would pull the information from those grids via Elastic Load Balancing that sort of sat in front so that we didn’t favor any one particular node in the grid. And that kind of worked well, but it gave us some problems, which I’ll go into in a bit more detail. But effectively what was happening was we’d be pulling that data from the Elasticsearch clusters in real time and presenting that to the user. But then, obviously, when the test was finished or the grid had been destroyed, we also needed to be able to retrieve that test data. So what we had was this horrible sort of bunch of code that would be serializing data back and forth from Elasticsearch and pushing it up onto S3. And so we had these different storage mechanisms for hot and cold data, and it was a bit of a nightmare to sort of maintain.
Tim Koopmans 00:07:05.608 Some of the issues that we had earlier on was it was difficult to scale horizontally because we had a bunch of these sort of cluster dependencies that would impact things like grid node startup time. It would take anywhere from say 5 minutes up to 20 minutes to start a 30-node grid, which was pretty much unacceptable for customers if they were trying to launch big tests, and they had this huge sort of delay in getting started. All of our code was based on this old sort of Ajax polling and then pulling information from Elasticsearch, so the data wouldn’t be really visible until some significant delays at some points. We also had this sort of the hot versus cold storage of time series data, some coming from Elasticsearch and then being stored in S3, and I guess the tools were kind of being abused a little bit in so far as their intended use. They were using Elasticsearch, which is pretty much a full-tech search engine, but we were using it to store time series data. But that said, there were some pretty cool faceted queries that we could run around descriptive stats on that data. So it served its purpose well, but we were pretty keen to move on from that design.
Tim Koopmans 00:08:26.996 So when we sort of thought about, “Well, how are we going to re-architect the infrastructure,” we came up with some design goals, and that would sort of boost based around the following sort of things. We thought, “Well, how would you design a botnet,” because that’s effectively what we were. So we thought we needed to basically have a distributed design, and we needed the grid nodes to be loosely coupled. And more importantly, we needed sort of like a shared-nothing architecture to run our load-test platform. And so these are just definitions from Wikipedia. I won’t read them all out, but basically, the distributed system is what we had. We needed to solve the problem of how we would pass messages between grid nodes, and to that end, we ended up using a lot of Amazon services so SQS and SNS in kind of a pub/sub fashion. So grid nodes could now start up by themselves and just receive messages via a topic and sort of routing rules, and so that sort of solved the problem of how do we tell or instruct grid nodes what to do in terms of starting and stopping tests, loosely coupled with—basically, we wanted to make sure that the grid nodes themselves had little to no knowledge of other grid nodes in their region, so that’s achieved today. So basically, grids really are just a logical collection of nodes, but each node really has no knowledge of any other node that’s running in the same grid. And shared-nothing, of course, is where we want to get rid of all of those dependencies, not so much memory-induced but the sort of clustered dependencies that I was talking about earlier, which impacted everything from how a grid started up and even affected the viability of the grids themselves where they might—the 29th node wouldn’t start after 15 minutes and then the user would have to start again.
Tim Koopmans 00:10:23.789 So yeah, we spiked Influx. I can’t remember now. I think it was about 2014 we were looking at some early versions of Influx, and we were really happy with the throughput that we were getting and the amount of disk space that was used and, obviously, the memory overheads was quite low. So we were pretty surprised about how well Influx performed for it to meet our requirements. And yeah, we kind of followed that through and didn’t really get on board until we kind of needed—I can’t remember the release number for Influx, but there was a point in time where the clustering support became commercial. At that point in time, we didn’t really want to bother kind of maintaining our own Influx cluster or making it HA. So we were really, when the InfluxDB Cloud solution came out—it was just perfect for our requirements, and we’ve been on board ever since then.
Tim Koopmans 00:11:26.763 And so the current design looks like this. So we have a similar design as before. We have application servers sitting in a central location and worker servers, but this time, instead of the workers pulling information from the grid nodes themselves, the grid nodes are actually pushing information up to our little proxy servers, which we call Drain because we like sort of water analogies, being Flood. And yeah, so what that means is we have as many grid nodes as we like operating. We have customers at probably—I think our biggest customer is launching in excess of 900 nodes in any one load test. So we have pretty much all these nodes just finding home, talking to these light services. So Drain is just a little service that we’ve written in Go, and then Drain can basically queue up messages or points to be written via a Redis server. And then we can sort of process that queue asynchronously at our leisure so without delay, basically, and write all the points into InfluxDB Cloud.
Tim Koopmans 00:12:26.593 Yeah, when we were sort of designing or coming up with this design, we were a bit unsure on how we were going to sort of collect and aggregate the results on the nodes themselves. But after talking to the Influx team, they suggested that we use Kapacitor because we didn’t realize Kapacitor could pretty much talk to InfluxData, basically. So that was a really cool solution for us. So every one of our grid nodes runs a local instance of Kapacitor, and that’s basically how we get their performance data from the tools into our system. So the quick wins that we got from that design was that there was obviously no more clustering on the grid nodes. So what that meant is that we could just infinitely scale horizontally, I guess. We saw really good improvements in startup time because, basically, whatever performance we get now from a single node is going to be representative of many nodes in a grid. So yeah, I think average boot time on grid nodes is about three and a half minutes, which is great. It’s not a big wait if you’re starting like 60 or 900 nodes kind of thing. So that’s a pretty good win for us. The other cool thing is that all of the data is kind of pushed up essentially, so it’s just basically available as it arrives, so we can build our dashboards around polling for that information. But we don’t have to get the whole data set. We can just basically put a bunch of rules on how we deliver the data to the browser. And of course, now we have our time series data sitting in a time series database, which kind of makes sense. And it’s good to kind of get rid of Elasticsearch in this particular case.
Tim Koopmans 00:14:10.160 So yeah, some of the design considerations when coming up with our, I guess, schema design and how we’re going to use Influx, early on we were sort of trolling the Internet trying to find good examples on naming conventions and how to design the schema. And I guess a lot of the terminology was kind of foreign to us because we weren’t really familiar with Influx and the way time series database worked. So we blindly followed some advice out there, but I guess one thing that we first thought was—that we read was series are cheap, but we quickly found out that was pretty expensive advice to take. And I think really, to be honest, over time, the manual that’s online with Influx is really worth a read, and so we rely heavily on just reading the documentation these days in terms of schema design and issues around cardinality and stuff like that. So I would encourage that for everyone else.
Tim Koopmans 00:15:08.209 So some of the things that we have implemented from that guidance is we really want to thin out all the measurements that we stored. So I think we’ve got around 12 measurements at the moment, so it’s pretty thin. Some of the performance metrics that I was talking about, we indicate the aggregation method in the title of the measurement, and you can sort of see on screen there that we’re collecting stuff around failed, passed, response codes, or response times, throughput traces, that sort of thing. More so for cardinality and I guess impact on memory, we really wanted to thin out the tags. We were probably being overzealous at first, and we didn’t really understand the full impact high cardinality tags had on memory. So after a few iterations, we sort of got it back to the following where we could—these are the sort of things that we want to tag by, so we use the account ID, the Flood ID, the grid ID, a label ID, and the project, and the region. That’s sort of the ways that we want to call up the data for our customers. And we also pushed values which had high cardinality like the actual label. The text-based label is basically just a value, so it’s not impacting our performance.
Tim Koopmans 00:16:29.013 So yeah, look, I don’t know if this is the right way, but one of the things that we’re trying to think about when—if you can imagine every customer has their own load test and their own definition of labels for each transaction, and so if we were to have label as a tag, it would have such huge cardinality because pretty much every test is going to generate a unique label of sorts. So what we ended up doing was storing a sequential ID called label ID, which is a—so basically, as the results are written, one of the first things that we do is write the label to our database or to our relational database, that is, and generate a sequential ID for that label for that test. So that really reduced the amount or the uniqueness of the data that we need to store into label ID. And in general, all of the other ID values except for account are all sequential ID. So what we were trying to do there is we didn’t want to have IDs that went up into the hundreds of thousands because, once again, that would also impact performance.
Tim Koopmans 00:17:39.825 I guess we don’t always get the design right and, to be honest, when I was preparing this or starting to prepare this presentation, I was checking out—I was interested to see what’s our current cardinality or what’s the current number of unique series that we have, and I was a bit surprised to find that we jumped up to about 5 or 6 million in terms of unique series given that we have only got 12 measurements. And I had to look at what was going on, and then I realized that this one particular customer was doing a really large test that had over 20,000 unique labels. So even though we were doing sort of the label ID hack where we had a sequential ID, I think some of the tests got up to about 50,000. So once I identified that, I had a quick look at our metrics, and you can see here in the top graph the series count went from something like 2.8 million to just under 5 million in a couple of days. So that was a bit of a worry because we also saw our memory utilization go from like 5 or 6 GB up to about 12 GB. So these sort of things can happen, and I guess the people listening to this presentation are probably familiar with that. But look, it happens to everyone, and it’s something that you need to keep an eye on.
Tim Koopmans 00:19:07.975 So I want to talk a bit more also around in terms of the design considerations for Influx about how you would push data back up to our central time series store. So what we are doing is using Kapacitor basically at the Flood nodes. So you can see here this is just one of the, I guess, definitions for one of the measurements that we’re collecting. So we’ve got a response time stream here. And basically, what is happening here is that the load testing tool has a plug-in which is writing via UDP to Kapacitor, and it’s providing a bunch—in that message, it’s using whatever the line protocol format is. It’s passing a whole bunch of things that can help identify what that Flood is. So like I said before, I’ve got account, Flood, project, grid, region, and label, and we’ve also got node because a grid can have more than one node. And s/o we’re basically doing an aggregation every 15 seconds, and we’re basically flushing that out to Influx. So this is where, I guess, what we thought we would need to do.
Tim Koopmans 00:20:18.476 So at first we thought, “That’s cool. We’re just going to get the results to write into Kapacitor, and then 900 Kapacitor nodes,” if you like, “are going to write straight to InfluxDB Cloud.” But what we found is because we were introducing additional tags at the node level, like especially the node ID itself, we weren’t really interested in, I guess, storing that tag centrally in our main installation. And so, in other words, we basically had this sort of excess data in the payload that we want to filter out, and we also wanted to sort of avoid all of these 900 different IPs kind of writing back to one location or—and to be honest, we didn’t really try that out because the sort of data manipulation requirement came up before we were thinking about how’s that going to perform. So what we ended up doing is writing that thing that I call Drain, which is basically a little ghost service that just looks and sounds like Influx server, but basically all it’s doing is receiving the distributed writes from the Kapacitor nodes and responding with the header that it expects. I think, yeah, it’s a 204 with a specific string. And then, yeah, Drain is basically writing those data points that it’s just received from the Kapacitor node into Redis. And then, like I said, we can process all of those writes off the Redis queue via our asynchronous workers, and we can scale those workers although we don’t need to go beyond three nodes at the moment. But basically, we can auto-scale those nodes out if the queue depth starts to increase.
Tim Koopmans 00:22:04.667 And so some of the other things that—the data manipulation that I was talking about—for example, if you’ve got 900 nodes, the chances of nodes writing data points at the same point in time means that only one point can be stored in the Influx database because that’s just the way it works. It really depends on the resolution, but—so what we ended up doing is we were interested in storing all of the nodes data. So concurrency is a good example because if you’ve got 900 nodes, you want to make sure that all 900 are phoning back in the right dialer. So what we would do is we would hack the nanosecond portion of the time stamp to be this machine-independent HMAC of the node and the label. So we’d use that to basically overwrite the time stamp and then therefore sort of guaranteeing that every point is going to get written back, essentially. And we can do other things like filter out stuff that we don’t want to write or—it’s a bunch of little sort of similar sort of data hacks that we do to the data as it’s coming inbound. And I don’t think I’ve shown it here, but that’s where we start to filter out what tags we want to write. It’s also where we do the stuff like the label ID conversion—sorry, I should say the conversion from the label into an actual label ID, which is a sequential ID that we store in Influx. So that’s basically the way we’re processing all the numerical data.
Tim Koopmans 00:23:33.434 We also process strings. I showed you before the traces that we store like the request in response headers and the response body for errors, so actually doing all of that via Kapacitor as well. And basically, yeah, we’re using the alert function in Kapacitor, which is pretty cool. We set it to always true, and it’s just executing a local binary on the grid nodes themselves. And then, basically, what happens from there is we can sort of collect that information, push it up to S3, and write a new point which points to where it lives on S3 with a unique ID. So yeah, we have a bunch of little sort of hacky things like that which work really well. So you’re probably not going to change some of this stuff for some time because it just seems to work really well these days. So how am I going for time, Chris?
Chris Churilo 00:24:29.556 You’re perfect.
Tim Koopmans 00:24:30.708 Cool. All right. So with that sort of design, this is what we’re processing. On a day-to-day basis, we do around 100 requests per second, maintain probably less than 20 milliseconds latency on t2.mediums which are pretty small. That’s the Drain ghost server that I was talking about. I think InfluxDB Cloud is processing around 600 write points per second. Peak, it’s probably less than that, and it’s really dependent on, obviously, how many grid nodes are running at any point in time. Series cardinality is less than 2 million, and that’s basically using about 25% of the available RAM at the moment. So we’re keen to sort of try and keep the cardinality low, and any delay in processing is basically absorbed via our workers in the Redis queue. So we can scale out of things if the queue depth increases or whatever. So we’ve got a bit of that built into there as well.
Tim Koopmans 00:25:29.018 I think what could be done better is if I was talking about the earlier example, if I hadn’t been looking at the memory charts and Influx fell over, it might have been too late and that would have been a bit of a disaster for us because, obviously, customers depend on the time series data, and it would also probably leave us no head room to actually do things like computationally intensive stuff like dropping series and stuff like that. So I think the ability to sort of be able to alarm and maybe monitor the metrics ourselves or at least callout to an escalation procedure is needed in the cloud solution. We really want to explore some different aggregation methods and especially with finer resolution for different customers. So 15 seconds is pretty good for the norm, but we do have customers that need higher precision in terms of aggregation. On that note, it’s really unclear for us at the moment on how we’re going to structure and scale our schema and queries to sort of support multiple aggregation methods, percentiles being a good example because, obviously, there’s some bad practice there if you’re trying to do percentiles on averages, that kind of thing. So we need to look at the way other customers are doing that, in particular sort of descriptive statistics. And I guess another thing that we’re not currently doing is enforcing data retention, although at 25% utilization, we’re not in a rush to, but definitely, that’s something that we’ll need to consider in the future if we start to expand the amount of measurements and the amount of series that we’re sort of collecting. And I think that’s all I’ve got on my presentation. So I’m happy to take questions. I’ll quickly check the Events Center.
Chris Churilo 00:27:25.472 Okay. So I’m just going to read the questions out loud for you so that everyone can hear them, hear the question, and also your answer. So how did you first discover InfluxData and—?
Tim Koopmans 00:27:37.968 Yeah, a good question. I think it was hack and use [laughter]. So like all startups finding information, we were interested in—I don’t even think we were Googling around time series as a concept. We were probably looking at sort of NoSQL kind of databases maybe as an alternative. But yeah, I’m pretty sure it was hack and use that we first sort of came across Influx as a solution. And then from there, we were pretty quick to try it out.
Chris Churilo 00:28:07.920 And then the second part to the question is, what would have happened if you maintained your old architecture and not adopted InfluxData?
Tim Koopmans 00:28:17.165 Yeah. We just wouldn’t be able to—we wouldn’t have been able to sort scale to the capacity or the requirements of our customers that we have today. So running 900 nodes across little 30-node grid clusters with Elasticsearch would have been extremely painful. We probably just would have lost business, I guess. I think as we brought on new members to the team and people writing code, the old code base became sort of untenable because it was so complex. It had a lot of working parts. We still have a lot of working parts, but I guess the intention is a lot easier to communicate now with new developers as they come onto the team.
Chris Churilo 00:29:02.597 So as a SaaS vendor, how has this product helped you as far as giving you a differentiated competitive edge?
Tim Koopmans 00:29:16.459 Yeah. A lot of our competitors are building their latest platforms on traditional sort of master-slave type where you have a controller and you have a bunch of slaves, and that really limits the amount that you can scale out. And so I don’t want to name competitors, but we’ve had plenty of customers come across and just be really impressed with the sort of scale that we can go out to because—at early days, it kind of seemed like premature optimization to think about these things, but we always sort of had—we never wanted the grid nodes to be a bottleneck themselves because, obviously, when you’re doing load testing, you want to trust the infrastructure that you’re testing on. So we needed to find a design that worked well and also components that kind of work well together. And so yeah, Influx and Kapacitor for us was just a really good mix.
Chris Churilo 00:30:14.717 So just listening to your presentation, it seems like you were able to get it up and running really quickly. Maybe you can articulate a little bit more about how long it actually took you to get Influx built into your system.
Tim Koopmans 00:30:31.899 Yeah. I honestly can’t remember how long we spiked it for because we weren’t really religious on our dev cycles. We were a really small team. At that point, we were three people. But I do remember sort of having Influx floating in the background. So I’d set up Influx. When I said I spiked it, I had actually set it up, and for a while there, we were just sort of actually doing two things. I think the hard part for—that the slow part for us was figuring out what we had to change in the plug-ins to the open-source load testing tools and, obviously, discovering Kapacitor as a solution because, at first, we thought we were going to have to do all this sort of—I can’t remember the design, but we were pretty convinced that we were just going to have to write [inaudible] into InfluxData. And we thought we would have, I guess, a design where we would have sort of edge service, if you like, and then sort of relaying back to a central kind of thing. And it wasn’t until I met the team in San Francisco for a different reason that they suggested, “Hey, you should just use Kapacitor.” And then, from that point, it just really sped up. I’d say in less than a month we were also rolling it out into production.
Chris Churilo 00:31:58.357 I mean, do you think it’s a flaw in our documentation or—I mean, if you hadn’t met with the team, you wouldn’t have thought about using Kapacitor. What could we do better to make clear the capabilities of Kapacitor?
Tim Koopmans 00:32:18.691 Yeah. I’m not really sure what the answer to that is. As a startup, you’re kind of swamped by options and you try not to just pick the next tool in the stack. So sometimes you do need a little bit of expert advice to nudge you in the right direction. Yeah, look, I remember I just sort of dismissed everything else in the TICK stack because, I guess, from the glossies, I was thinking, “Oh, yeah, this is just like a monitoring and loading thing,” and I didn’t really get into the detail. But it wasn’t until I sort of explained my requirements and said, “Hey, look, we’re trying to do this weird thing where we have all of these distributed writers, and they’re going to write back to a central location.” And then once I sort of could articulate the requirements like that, Influx team were like, “Oh, yeah, you should just use Kapacitor for that. That already supports basic and write-out to Influx.” So yeah, I don’t know, sometimes you do need that expert advice, but you need to have—I don’t know. It can be really hard, I guess, because you don’t—it’s hard to Google this stuff. You put a query and say, how do I do distributed rights with Influx, and you’re pretty much not going to get the answer that you’re looking for that kind of thing. So yeah, it can be difficult.
Chris Churilo 00:33:41.704 Yeah. I’ve been trying to put a little bit more emphasis on Kapacitor recently, trying to go into a lot more advanced topics, also trying to articulate different ways that people are using it because I do feel like it is not getting the attention it probably deserves.
Tim Koopmans 00:34:00.588 Yeah, yeah. And I think the TCL and the language, the stuff that you can actually do with it—I think the more examples, the more use cases you have, the more I think people can sort of say and go, “Oh, that wouldn’t be really useful in my situation.” So more examples is always good, I think.
Chris Churilo 00:34:21.636 Well, that’s why we’re doing these webinars, trying to get more examples out there.
Tim Koopmans 00:34:26.355 Yeah, exactly.
Chris Churilo 00:34:27.328 So what advice would you give to a SaaS vendor out there that hasn’t even implemented InfluxData or any kind of monitoring outside of some of the old kind of standard monitoring systems?
Tim Koopmans 00:34:44.737 I think you should just—I mean, if you’re SaaS, you’ve got the luxury of just being able to choose your own way. Oh, sorry, I’m suggesting that all SaaS’s are small startups. But yeah, if you’ve got the flexibility to try out tools and your dev life cycle sort of supports it, then it’s pretty easy to sort of spike stuff these days. Docker makes a lot of that easy. There’s lots of pre-built sort of images or pre-built containers that you can run. So yeah, it’s just a matter of trying. But yeah, I’m not too sure. It’s still a bit of information overload about which tool you can use. And I guess the more clearly you can sort of articulate your requirements, and discuss it with your peers, and go to meetups, and all that sort of thing, the more likely you are going to find the right solution, but yeah.
Chris Churilo 00:35:39.700 Okay. Cool. Oh, we have another question that came in. Let’s see. So did InfluxData help focus on application building rather than on adjusting infrastructure?
Tim Koopmans 00:35:52.161 Yeah, absolutely. I mean, this is our current UI, and it was basically all rebuilt using mJS. And so it was just great. By the time we were sort of redesigning our UI, we had a really clear API into Influx. There was no work at all in terms of how would we consume the data. There’s plenty of options to sort of get that data efficiently in real time. So yeah, that stuff really improves the sort of life cycle in terms of development. And I guess, yeah, Influx where it is now is kind of feature-complete in my way. It keeps improving. Every release, things get faster, and there’s some stuff around like high cardinality series changes that I guess are coming, which would be beneficial. Well, in our experience, Influx is production-ready. We’ve been using it for quite some time now. We’ve had no issues in terms of outages. So yeah, it’s a tool that you can use straight away.
Chris Churilo 00:37:01.819 Yeah, I think we can attribute that to the community support that we have behind it. And I think it’s just another reason why open-source tools are just so important because I think by the time we become what we consider as a company production-ready, we’ve actually had so many people using it, and so much feedback, and so much input that it’s miles ahead of any kind of closed-source offering.
Tim Koopmans 00:37:28.333 Yeah, exactly right. And often, when you get sort of more familiar with the product and because it’s got the open-source component, you can actually go to GitHub and start tracking the issues yourself and seeing what other people are saying. So yeah, it’s a pretty good source of information, and it’s good to sort of—you get an inkling into what’s happening in the dev cycle for the product as well so like what’s coming and what people are thinking about, so. That’s just not present in other platforms that are more closed source, I guess.
Chris Churilo 00:38:00.442 All right. So I’m going to leave the lines open for questions from some of the other people just for a few more minutes. And in the meantime, we’ll just keep chatting here. Like I said, this session is recorded, so you can take another listen to it. And don’t feel shy. Please ask your questions. And then Tim, sometimes the questions will come in a lot later on. I’ll get an email and I’ll make sure that I forward that to you so that I can at least connect you with some of the people that are—
Tim Koopmans 00:38:30.493 Yeah, not a problem. Really happy to share our experiences in an honest manner, so.
Chris Churilo 00:38:37.562 Yeah. We’ve been seeing quite a number of SaaS vendors that are using InfluxData with their service, and one of the customers recently that we just had a webinar with was NewVoiceMedia. And one of the things that kind of surprised me, and maybe it’s just me, is that they actually use InfluxData for resource planning for the next release, and what they are able to do is they can actually start to track to see, okay, which of these calls are being made, or how often are these calls being made to start to determine what is actually being used within their service.
Tim Koopmans 00:39:23.352 Right, yeah. So more like forecasting.
Chris Churilo 00:39:25.966 Yeah. Are there any other things that you guys are doing with InfluxData besides what you just articulated that might be a little—?
Tim Koopmans 00:39:35.230 Yeah, definitely. We’re really interested in trending and forecasting because now, as you can imagine, we have this huge store of information across all the customers’ tests. I guess, traditionally, the way we designed it was everything was like a per-Flood view of the world, which is fine because a lot of customers are just interested in one test’s worth of results. But as you can see in this particular—this is a demo project, but it’s got like 100 10-minute or 10- or 5-minute tests. So we’ve got this really big timeline of data, continuous or non-continuous, but you know what I mean. It’s a big timeline of data that we’re really interested in sort of getting people out of like just a per-Flood or per-test view of the world and start looking at, hey, what’s happening to my application performance over time, and can we use the—not the statistical but sort of like the—I forget, the different methods that are available in Influx, at least, to start doing regression analysis and finding points like this and then implementing callbacks or notifications ourselves. So yeah, really interesting what we can do with that data over longer term, and that’s, I guess, the key why we haven’t really deleted data today because, at this stage, we’re interested in sort of doing more with it.
Chris Churilo 00:41:09.867 Excellent. Okay. Well, I don’t have any more questions and, inevitably, every time I say that questions are over, someone’s going to come in with another question [laughter]. But like I mentioned, I will make sure that I forward that on to you. And I’ll also start posting some of those questions that I think would be pertinent on our community site so people can take a peek into this recording. As always, everybody, these are recorded, and I will be posting them before the end of the day so you can take another listen to them. And Tim had shared his contact information at the beginning, so if you do want to reach out to him directly, feel free to do so and ask him questions about his use of InfluxData in his infrastructure. Tim, I really do want to thank you for your time. I know you’re super jet-lagged. I know you’re completely across the other side of the world from home [laughter] and away from the kids, and I really do appreciate this webinar and our chat previous to this webinar too.
Tim Koopmans 00:42:13.975 No worries. Thanks everyone for your time. And yeah, it’s great to be able to talk about the product in a different light. So I really appreciate it.
Chris Churilo 00:42:20.870 Thank you so much. And with that, I hope everyone has a great day, and thank you very much.
Tim Koopmans 00:42:27.447 Catch you later.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.