PipelineFX offers a software product, “Qube!” to help organizations of all sizes to better manage rendering for digital media applications and programs. This helps their customers to achieve maximum efficiency and optimizes their existing and future infrastructure. In this webinar, John Burk from PipelineFX, will review how they use InfluxCloud to gather metrics and events to help them in the effort of optimizing their customers’ infrastructure. In addition, they use InfluxCloud to allow them to provide a real-time view to their customers on their current usage and billing charges. This allows their customers to maximize their investment in Qube without having to commit in perpetual licenses that may go unused.
Watch the webinar “How PipelineFX Uses InfluxCloud to Differentiate Their Service” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How PipelineFX Uses InfluxCloud to Differentiate Their Service”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• John Burk: Senior Software Developer, PipelineFX
Chris Churilo 00:00.633 Good morning everybody. Thank you for joining us in our customer use case webinar. Today, I’m excited to share with you John Burk from PipelineFX. And he’s going to go over the two different use cases that he has with using InfluxDB in his environment. The first one is something that is I hear a little more often with our customers and basically using InfluxDB to help drive metered pricing with their SaaS solution. And I was hearing this from a couple of customers. I thought this would a really exciting thing for you guys to hear about and learn about how he has implemented it. And additionally, just like a lot of our customers, he’s also using InfluxDB for gathering standard system metrics. So with that, I’d like you to sit back and listen. And make sure if you have any questions, go ahead and put it in the chat panel or in the Q&A panel. We’ll make sure that we get those answered during the training today or during the webinar today. And I’m going to now hand it over to John.
John Burk 01:00.361 Thank you. Hi. So yeah, I’m from PipelineFX. We make a rendered queuing product called Qube!. And a lion share of our customers are involved in special effects production and CGI for the feature film industry. So we have security considerations for them because a lot of them do movies that they’re working in shots that can’t be leaked in advance, etc., etc. And it also turns—their needs are very bursty. So they may need 10,000 CORS in one night and only a couple of hundred the next night. Because a lot of the CG, the scale is enormous. A lot of times, it’s 24 frames a second. There might be 10 or 20 layers for an individual shot. Each one of those is at 24 frames a second. Pretty soon, you’re talking hundreds of thousands of hours. There are millions of hours that need to get done by a certain time quite often. They have deliveries that are, say, on Friday. They realized on Monday, they’re not going to make it. So they need to scale up from, say, 500 machines to 1,500. So they may just spin up a whole bunch of cloud nodes.
John Burk 02:09.397 And so in order for them to avoid having to own that many perpetual licenses all the time when they don’t often use them, we allow them to over-subscribe licenses. You can see in the graphic here that’s a good example of what happens—is we’re really busy some nights. And other days, we go days, and days, and days without needing all the licenses that we currently own. So we don’t want to buy—they don’t want to buy licenses or PQ’s. It just doesn’t make financial sense. So we allow them to burst like this. And this is just even over a one-week period. You can see here that they have a fairly bursty use. And then they may actually have periods of three weeks, or a month, or more when they don’t do anything in between shows.
John Burk 02:57.272 What we also do is we allow—we have different pricing models. We have a perpetual license product that we sell them, and then we allow them to rent licenses on a one day to one year. The sweet spot tends to be about one month. So they’ll rent for a month. But the one-month stuff is if they know they’re going to have use of those licenses pretty close to at least half the minutes in a month. If they have something that’s more bursty than that, then we allow them to meter. We track the usage every minute. And then we transmit it in batches of 15 minutes. The collectors run on premise on the customer site. And if they lose network connectivity of the Internet, it just tends to accumulate the batch. And once they get connectivity back, again, it will just transmit the entire batch. You’d see batches of like 8 to 12 hours’ worth of data, not just 15 minutes. So we get some—occasionally, in InfluxDB, we get some late-arriving data. We can tolerate that as well.
John Burk 04:04.521 We store the license usage data at full precision forever in a MySQL database. We use that for reporting and for billing, and we need full precision. And we also store it forever because occasionally, the database is accessible to our customers through a REST API, so they can pull all of their own records, usage records, if they wanted to do an audit to compare their bills against their recorded usage. So they have the data forever. And the other part of the reason I keep it on a MySQL database is that the usage data is part of a highly relational schema, having to do with the—usage data is keyed off of a—the dispatch manager is called a supervisor, and it’s keyed off of its primary MAC address. And so our records are keyed off of that MAC address, and a lot of other things in the database such as license keys and version entitlement are keyed through that MAC address as well. And that key ties back into billing accounts, users, etc. We have a lot of highly relational schema as well as just tracking some metrics.
Chris Churilo 05:18.957 So, John, I’m just going to ask you a question about this graph that we’re looking at. So what’s on the y-axis, from 0 to 500?
John Burk 05:26.243 0 to 500, that’s a license count. That’s how many licenses they used at any one point in time. The solid bar cross, they basically own 200 licenses of perpetual or a sum of perpetual and rentals. And then the green is the active in use. And so essentially, the area under the green and above the gold is the sum total of how much their bill will be for that day or for that time period. So they use up to about 450 licenses. They tend to cap out that many because that’s how many machines they have on premise at the time, and they’re not actually leasing or spinning up any more cloud nodes at that time. And they don’t own as many licenses as they have machines because of the infrequent use. That excess capacity is probably they’re using some desktop machines in the evenings. So they probably own 200 machines that they have in a rack that are available 24/7 that they can use all the time, and then they expand out to their desktops at night for burst.
Chris Churilo 06:31.872 So then it looks like in pretty much real time, they can see how much their bill’s actually going to be based on the data that you guys are pulling for them?
John Burk 06:40.139 Yes. And one of the widgets in the dashboard that they show – I don’t have a screenshot of it – is a widget that shows them the current billing for the month up to the last 15 minutes. Because the data is reported every 15 minutes, they basically get the sum total of the usage up to the last reported record. The amount showing in the client’s dashboard, at most, is 15 minutes out of date. So that’s why it’s always stored at full precision as well because then they can actually roll backwards and look at their data even like 18 months ago and find out how much they used over a certain period if they want a report it themselves.
Chris Churilo 07:25.479 Oh, that’s pretty interesting. Thank you.
John Burk 07:27.251 Yep. Thank you. I want to do charting as well of this, but I don’t want to chart out of the full precision forever because it’s 1,440 records per day per customer. And I can’t chart 18 times 1,440. That’s too many data points. I need to downsample. So I basically push it into InfluxDB. So I have extract, transform, load—an ETL script which runs every minute which pulls from MySQL into InfluxDB. And I do the down sampling through continuous queries and retention policies in InfluxDB because that’s what it’s good at. MySQL is not intended for time series data. Having used it that way in the past, I can testify it is pounding nails with a screwdriver. It’s definitely not what you want to do in any relational databases.
John Burk 08:25.219 Here’s an example of a customer’s dashboard. The usage data that we track is on the bottom half here. And you can see there a gold line. It varies between three and four hundred oftentimes because they’ll—if they anticipate they’re going to have a high-usage period. They may subscribe to—they’ll rent an additional 100 licenses or so, this customer, for 30 days or a week, however long they think they’re going to need it. And then they just basically track it. And you can see in the middle of the chart here, you can see how they were actually using a fair bit of metered, and then they eventually went and subscribed to some more licenses about the last quarter of the chart to cut down on their metered usage cost because the permanent rate on the rentals is cheaper than the permanent rate on the metered.
John Burk 09:23.128 The chart above is the per-day charge. Part of the ETL script will sum up the totals, the total metered and the metered minutes, the worker minutes per day, and then convert that into the billing amount per day. And I store that in InfluxDB one data point per day per client. So even on the 18-month retention period, I’m only doing 18 times 30. It’s about 540 data points for an 18-month per-day charge. And I don’t really have—because it’s only one point per day, I don’t have to downsample the usage, so I get good precision on that. Whereas the usage charts, that is definitely downsampled. So I try and keep—the target is no more than 2K, and I think some of these charts get up around 1,800 or 1,900 data points. But that’s just to keep the user experience—keep the whole web interface performing because I can have many clients all banging away at the portal at the same time, and each one is always trying to update their graphs. And so everybody loves to lean on the refresh button. Yeah. As I said here, the per-day charges are stored on a per day. I never down-sample them, whereas the usage data is downsampled.
John Burk 10:56.408 And I can come back to the license usage system use case in a bit. That’s kind of the custom one. Our customers’ dispatch manager is a supervisor. And it helps us troubleshoot sometimes if they have performance issues, if we can get system metrics on their supervisor. And a lot of our customers don’t have their own system metrics reporting infrastructure like Nagios, or Ganglia, or any of those others. They either don’t have the IT expertise, or they’re not motivated to do it. While the majority of our seats are sold into customers that are between 500 and 2,000 nodes, a lot of our customers themselves are in between the 20- and 50-node compute farm size. And they tend to have a single-person IT department who just doesn’t have the time to scale something like that out.
John Burk 11:57.612 So we came out with the system metrics solution as well for those clients so that they get some visibility into their own supervisor performance. And it also aids us in remote troubleshooting. It’s like they phone up, and they’ll say, “My supervisor crashed today,” and we can say, “I can see by your charts that you ran out of disk space at about 10:40 in the morning. Perhaps you just need to clean up some logs or clean up some disc space on your supervisor and restart it, and you’ll probably need to repair your MySQL database as well.” The metrics we’re gathering, for the most part, are pretty typical and vanilla. I’d say about half of them are very typical system load, CPU utilization, memory usage, etc. And then we also have a lot of custom collectors. We track a lot of Qube!-specific metrics, job counts, how many jobs are running at any one point in time, how many jobs are in queue in the system. And because our product is—when it’s very busy, it uses MySQL in a very right-intensive fashion. It turns into a high-IO database server. So we track a lot of custom-supervising metrics as well.
John Burk 13:15.127 As proof of concept, I wrote all the collectors in Python virtual environments, and I’m finding them to be problematic to package up to install in the customer. So I’ve got the Linux installs fine now. But I’m having trouble properly packaging up all the collectors for Windows because of the way that the virtual environments work. So I am actually going to investigate refactoring all my collectors in Telegraf and basically shipping Telegraf collectors that customers can install instead. And I’m hoping the packaging and installation will go a lot smoother with Telegraf than it will with Python virtual environments and custom collectors.
John Burk 14:03.781 The collectors are run on the customer’s hosts. They sample every 15 seconds, and they record through a data relay. None of the collectors on either the system metrics or the metered licensing have to talk to the Internet directly. We run a data relay agent that customers can install, in the simplest case, right on the supervisor host itself on another host on the network or for the most—most of our clients are fairly security-conscious, so they run the data relay agent in a DMZ host. And so the DMZ host only needs to punch through the firewall to access the single port. On the supervisor for the metered license usage and for the system metrics, the collectors just need to reach the data relay agent. And so they can basically configure to have a single outbound port on the firewalls. And then the data relay agent really only needs to talk to a single IP address on port. Actually, it’s more of a—they talk to an IP range because they’re all behind elastic load balancers in Amazon. But once again, it’s all unknown set of hosts, and all data is encrypted in with a X.509 certificate, so it’s all SSL/TLS. And all our certificates are signed by a CA, and we don’t use any self-signed certificates because a lot of our clients have security concerns so we had to withstand a security audit through all this.
John Burk 15:38.881 And the CQs, the continuous queries, and the retention policies, retention periods, retention policies, for all of these are once again defined so that I limit any single chart to 2K, 2,000 data points. Most of the metrics charts are a single series, so they’re 500. So they chart nice and quickly. A couple of the charts have four series in them. And so that brings me to a magic 2K data limit. And I find that I can—I’ve got a thumbnail here. When they go to the dashboards, there are 15 different charts for you to draw every time they refresh or click on a new retention period. So because I limited the suit to 2K, and most of them are only turning 500 or 1k data points, all 15 charts draw in usually under half a second. So these, and it’s got the nice HTML animation where they all bounce up. So the customers tend to like it quite a bit. And there’s also drilldown on these charts—you click on any chart, and you get a modal which takes up the whole screen and shows you the details on an individual chart, so you can track it more precisely.
John Burk 16:55.518 So the systems themselves—we have the metered license portal which is kind of our software as a service. It’s not really SaaS, but it’s the customer’s interface to the dashboard. They’ll be in the website host. They’ll access the meter.piplinefx.com, points to an Amazon ELB, elastic load balancer, and there’s one or more hosts behind that. I generally run at least two for high availability under periods of high load around four or five. I haven’t really needed to scale up much beyond that yet. All interaction is over REST APIs. The license usage collectors talk directly to the back-end hosts over REST again, and the back-end host have between two—I’ve scaled out to about 10 just as a proof of concept. It turns out I really only need two back-end host right now still for the load that I’ve got at the moment, but I can basically just spin up as many as I need in about two and half minutes.
John Burk 18:05.769 And then all MySQL data is all kept in an Amazon RDS, the multi-availability zone. It’s a two-node MySQL cluster. Through notification, I know I’ve lost a frontend host from time to time, and basically, I spin another one up. I’ve got a back-end host go down once, and I just went and gave a reboot. And I’ve a MySQL RDS instance, but Amazon itself, basically AWS, basically managed that and spun that one back up to get me back to an HA state. So it’s pretty well-covered. I may spin up at some point. I may spin up a third frontend host and a backend host that have low bouncers that maybe want to go and take a longer vacation, have a week’s worth of high availability. So that’s the license usage data.
John Burk 19:05.014 And then I figured that once we need a churning, I went—and I’ve got a lot of experience going back to about the mid-’90s with—well, actually with clustered products to the mid-’90s, and I started churning data in the early 2000s, and I’ve got a lot of experience with RRD tool, the Round-Robin database, and then going on with some other time series databases and graphing that stuff. So I’m familiar with time series data, and I knew I didn’t want to keep it all in MySQL. And I went and looked around. I looked at RRD tools, nice, but it’s very old, and graphing that stuff has always been a roll your own. I looked at graph-finding, Grafana, but the multi-tenant Grafana didn’t really suit my needs because I could have several hundred clients in trying to maintain all the different snapshots from the clients on a multi-tenant dashboard. It just wasn’t a suitable product. So I went with a “roll your own” with an InfluxDB cluster, and I did a proof of concept originally using an open-source InfluxDB. And I ran that for about a month and decided it was definitely what I wanted to go with. And yeah, I was very happy with using that. The documentation was good. And the down samplings, I could have good control over the downsampling.
John Burk 20:31.381 Now, I needed a good data from the MySQL into the InfluxDB, so I added an ETL host here in between. And the ETL host runs once a minute, basically pulls new records from the MySQL data, and it does transform and loads it into the InfluxDB. So every minute, I’m getting the latest data out of my MySQL. The ETL basically just tracks the highest record in the license usage database. And every iteration of the ETL script, I just basically get all new records starting with that new ID. And that last-record-pulled ID, I also store as a single value. It’s a single-value series in InfluxDB. So before I even start, I basically check network—I verify network connectivity to the InfluxDB cluster every time I run the ETL script because the first thing I do in the ETL script is get the record or get the ID that I need to start pulling my MySQL data from. If I can’t get that, the whole ETL doesn’t run. So I don’t even bother hitting my MySQL unless I can contact the InfluxDB first.
John Burk 21:51.295 And then I have the system metrics which is my other use case for the InfluxDB. It’s a much more standard. I have a System Metrics Collectors, which runs on all my customer supervisors. They all report to another ELB which is behind a—I think I call it metrics.pipelinefx.com, that DNS C-Name record points to the AWS ELB balancer in the metrics server solution here. And the InfluxDB cluster—the only hosts that actually communicate directly with the InfluxDB cluster are the servers in the metrics server cluster that I have down here in the bottom right and the back-end host in the metered license portal. So my end users never actually contact my InfluxDB cluster directly. I’ve not really released that. It’s not made publicly available. And once again, a lot of it is just to allay security concerns for my customers because even though it’s just system metrics data, some of them are like, “We can’t have any data exposed.” So they tend to be very, very security-conscious. So a lot of our biggest customers work—they all tend to work on the same Marvel movies these days. It’ll be like five of our customers will be doing shots for the latest X-Men, or the next Thor movie, or something of the sort. A lot of times, they can’t even tell us which shows they’re working on, but we know it’s a Marvel show because of the security requirements. And now with the Disney and the Pixar, a lot of the shots for that get farmed out to third-party customers—or third-party CG studios of our customers, and so they tend to have to be very security-conscious. This is what the whole system looks like as well, so you get the overview. And that’s about that.
Chris Churilo 24:00.400 So, John, let’s talk a little bit about your InfluxDB cluster. So you are currently an InfluxCloud customer. Right?
John Burk 24:06.116 Yes. I started my proof of concept as an InfluxDB open source, and then I decided this is working very well. This is what I want to do. And I learned a lot about InfluxDB in a very short amount period of time. But in InfluxDB cluster, I wanted high availability and I wanted scalability because I intend to—I mean, if god forbid, we’re successful with this, we’re going to have 1,000 or 1,500 customers all reporting metrics at one store or another, and I can’t ever have the either license portal appear to have a down time. So I needed the HA. I needed the high availability as well. And the InfluxDB cluster itself is a sophisticated product. It’s got a lot of moving parts. There’s a lot of tuning involved. There’s a lot of deployment. There’s a lot of just ongoing monitoring for a cluster product, not for an open source, even a single open source solution. I didn’t want to have to become an expert in something I’m only going to do once. I don’t want to become the guy, and I don’t want to have to set up the monitoring for the InfluxDB cluster and have somebody—we’re not a large—my employer, we’re not a large enough company that we want to devote man hours just to monitor an InfluxDB cluster.
John Burk 25:31.247 I just want to pay somebody to have it monitored, have it maintained, and make sure it’s always up to date, make sure the hardware is always current. I basically just wanted a managed solution. At the monthly rates for the InfluxDB cluster, it’s not a lot of money compared to a burn rate of somebody who would actually have to spend the time to learn how to manage it and learn how to install it, configure it. It’s not going to be just a, “I can start it up and let it run, and then look at it once a minute every three months.” They’re sophisticated, and there’s a moderate level of complexity there, and there’s a lot of tuning involved. And as your needs grow and as your usage profile changes, you’re probably going to have to change tuning parameters. And a lot of that involves a lot of expertise, and I don’t want to be an expert in that. I have enough on my plate already that I don’t need to become—I don’t need to become an expert on something that’s not our core business. So yeah, that’s why I went with the InfluxDB cluster, and it’s been basically, a fire and forget. I pay a pretty moderate amount every month, and I get a bunch of guys—a bunch of experts who know the product inside and out basically, babysitting it, and upgrading it, and keeping an eye on it.
John Burk 27:04.534 I mean, I’m not getting—this is no consideration—I’m not receiving any consideration for chilling the product. It just really works really well. And it’s pretty well-priced from what I’ve seen compared to a lot of other products. So yeah, that answers that. And doing the migration, I was able to do a dump and restore. I basically, did a InfluxDB dump out of my current product, out of my open source, and I just dumped it onto my cluster. And it was pretty straightforward. It was also straightforward because I didn’t have hundreds of millions of records in my open source. I basically made the cut over at the proof of concept, at the end of proof of concept stage. So I had a couple—I had 100,000 data points or so but not too many. So it really didn’t take very long to load. Of course, the memory usage profile wasn’t very happy at the time when I did the dump. It was a very busy cluster for about half an hour as it sorted everything out and moved it through the retention policies. But it was fine. And I did the InfluxDB cluster as well on the 30-day evaluation just in case for some reason I wasn’t happy with it, and I had to hunt and go back to the open source. I could have done that as we well, but I didn’t need to. And at the end of the 30 days, I just signed up and basically, kept going and paid the bill.
Chris Churilo 28:46.488 So you and I talked about this earlier, but how did you learn about InfluxDB?
John Burk 28:50.877 Okay. I have a colleague who’s a very experienced developer and he—so basically, word of mouth and recommendation. I’ve got a lot of experience in time series data but not necessarily with modern products. And I looked around, and there was a couple of contenders out there. But essentially, this other person has designed a product which was to track metrics for an entire compute farms, a couple thousand nodes and a couple hundred metrics per. His solution was scaling out to over a million records per second on ingest, and he said that InfluxDB did the trick for him. It worked really well. And I was like, “Well, if he picked it, it’s got to be good enough for me.” So I did some research, but then I just went with—I went with somebody whose opinion I respected. And so I can’t tell you all the research that he did, but he basically came up with InfluxDB as well. So that kind of piggybacked off of him [laughter].
Chris Churilo 30:05.688 Cool. Awesome.
John Burk 30:06.666 Yeah. But from what I’ve seen, it’s been the right solution.
Chris Churilo 30:13.302 So if you could give somebody advice, let’s say that they’re starting out, similar to the situation you were in when you started with this, what advice would you give to them?
John Burk 30:24.235 Oh, for InfluxDB or building out a whole thing?
Chris Churilo 30:28.204 I think for both.
John Burk 30:29.647 Okay. I’d say for building out a whole thing, I always started out with—I would build a load bouncer and put one node behind it for everything. So I’m already set the scale out, so I know that it all works through load balancer already. It’s one thing to get it working, say, if you’ve got a REST API host, it’s one thing to have your stuff connect directly to that, and it’s another to work through a load balancer because there’s a lot of moving parts going on in that. So I basically built out an HA infrastructure with no high availability, but it had all the points, all the moving parts I needed. So I’d have each one of these components. I have frontend host that had one host behind an ELB, and I’d built it that way. The one thing that I might do from scratch, I mean if I were to do this over, is I would start right away with my collectors being Telegraf. I started out with building them up in Python with Python virtual libraries. I thought that would be easy. But I’m finding the packaging is problematic. So I would just go with Telegraf to begin with. I would really look at how much data your customers will—your customers will always want to look at everything at full precision. That’s not really feasible.
John Burk 31:58.300 So I would say it’s not what they want; it’s what do they need. I figured my customers would need—for some stuff, they’ll need 48 hours in a week and 3-month and 18-month views for retention policies. And then I figured out, how many data points do I need that actually can chart—how many data points make a recognizable chart that’s still small enough that it charts quickly? And I kept coming up with around—I did a lot just seat-of-the-pants benchmarking, and it seemed to be about 2,000 data points on my website, which is AngularJS using ChartJS. It’s Angular and Bootstrap using ChartJS. And after about 2,000 data points, they started to chart a lot slower. It seemed to be an inflection point. So I basically picked 2K as the upper limit on anything I ever wanted to chart at any one point in time. And then I worked backwards from those values figuring out what I wanted for my retention policies and continuous queries. So I kind of figured out how much I wanted and then worked backwards from that. And then I wrote all the CQs, so that for each time I would chart on a retention policy or retention period, I was only getting 2,000 or less. So yeah, so I kind of worked backwards for that. Yeah.
Chris Churilo 33:28.749 That makes total sense. I mean, I think you’re absolutely right. As human beings, we get really enamored with charts, and we think, “Yeah. We want as much data as possible.” But I think you’re right. There’s definitely a limit where we don’t need to have all that precision. We don’t need to have every single point on there.
John Burk 33:44.306 Yeah. Especially going back into the thumbnails right here. A lot of times, if I’m looking at 30-day metrics, I don’t need to look at permanent precision. I just need to look at overall trends on the left hit and the overall, on the dashboards and such. The other thing is really watch—it’s tempting to try and track too much data, but you could find that a lot of times, you can track stuff that is of no use whatsoever. I’ve done that in the past, and then it’s like, if you’ve got a chart but it doesn’t really tell you much, and it doesn’t correspond with any loads, and it doesn’t answer a question that a customer wants, you shouldn’t be showing it to begin with because sometimes too much data is worse than not enough data because people, they just—it’s how do you interpret it? Watch out how much you chart. And a lot of times, if you have to explain a graph, it’s too complicated. It’s like the best jokes are the ones you have to explain, not so much. If you have to explain a graph, you’ve failed [laughter].
Chris Churilo 34:56.283 I think you’re so right in that. I think we’ve all done that just because we can grab the metrics. Yeah. Our first inclination is let’s just chart it. But if it’s not really going to do anything, we’re wasting our time.
John Burk 35:07.389 Yeah. If you can’t show it to somebody—if you can’t show somebody a graph, and they can look at the graph and know what question it answers, the graph shows you. It even shows you which question it’s answering, and then they can make sense of it at a glance. That’s a successful graph, and you’re tracking the right amount of data. And a lot of times, you don’t need all the precision. You don’t need a lot of high precision. You just need a recognizable curve because—the detailed chart here is a 24-hour chart. But I’ve sampled however many minutes, 1,440. There’s 1,440, but because I’ve sampled it down to 500 data points, and there’s four series there in that single-detailed chart, that’s basically trying to chart 2,000 data points across four series. But there’s only 500 data points in any one series, but it makes a recognizable curve. So people will say they want all the precision, but they don’t really need it. So yeah, it’s a trap you can easily fall into, too much precision, too much data, too many charts, too many graphs.
Chris Churilo 36:28.289 Yep. We see it often [laughter]. Like I said, we just get kind of enamored with these things. And it definitely takes experience to kind of hold ourselves back.
John Burk 36:38.641 Even these 15 charts I’ve got here, I started out with 14, and I had one tile, one chip, as a leftover, because I knew I was going to want one. I didn’t know which one I—and then I added in one other metric, which it turned out wasn’t actually very useful, and it didn’t correspond—a lot of this, we use to look for—we’re trying to troubleshoot system performance when things tank, when things get overloaded. And the metric I was tracking, I thought would correspond with some sort of period of overload, but it never did. And then while we were looking at it, the question kept coming up, “Well, how many Qube! worker licenses am I using at the time?” And I was always having to flip over to my usage metric chart, which was the other one with just the gold and the green chart. And that one wasn’t even in the right precision because the highest precision I had was 48 hours instead of a 2-hour chart. So I realized I also needed the Qube! worker licenses, which I have down here in the bottom left, second from the bottom. I also duplicate that series or that data chart over here as well but in a different precision. Because it turned out, there was a question people always wanted to answer in a different precision that I was showing somewhere else.
John Burk 37:58.869 So that turns out that I’m actually showing the data there twice because it’s so useful in different contexts. So occasionally, that happens as well. But a lot of it is—a lot of it is going into this knowing you’re not going to get it right the first time. You’re going to bring it up, and you’re going to bring up a metrics chart like this or metrics set of chips, and then you’re going to find that over a—you’ll think you’ll use it a certain way, and your customers will always do something different. And then once you talk to enough customers, you realize that 9 out of 10 customers are still—they’re asking the same question that’s not answered. And so you’ll say, “Okay. So I need to add that into the general use case.” And there will always be one guy that will always have all these exceptions. And you just can’t make everybody happy [laughter] unless he’s your absolute biggest customer by a factor of two to everybody else. Sometimes you’ve got to accommodate that guy. But often, you try and answer the questions that most of the people ask in a clear fashion.
Chris Churilo 39:10.453 Well, I mean, it’s clear that you listen to your customers. I think these charts are very nice. And I think even for me, I’m not your customer, but I can tell just looking at this what I’m looking at and what I’m trying to determine by looking at this. So I think very nice job here.
John Burk 39:27.235 So those ones at—this chart on the bottom left, the install chart, they actually use that one a fair bit as well to look at historical usage. It’s like, “Had I bought enough licenses? Would I have—” because they second-guess themselves all the time. They’re like, “Did I buy enough licenses for the period in May and early June?” And they’re like, “Maybe next time, I should think that—maybe I should have bought 50 more licenses even though I was often peaking over 200 more than I use.” So they refer to that a lot, too. And the per day charges, it doesn’t answer things so much. But they can actually see peak because it always comes down to money. How much did I spend? When did I spend a lot of money? Then they can look back and say, “Okay. I spent a lot of money middle of June.” And they say, “Well, that was because we failed in our scheduling. And we had an emergency—we had an emergency to meet our deadline. So we had to spend some more money.” Because a lot of our customers, they have penalties if they’re late, per-day penalties. So sometimes, they actually can save money by spending money. What a concept. Every now and then, they just—and then a lot of times, it just—people like charts. It’s like blinking lights on computers. I’m just going to say it. Sometimes it just makes them happy [laughter]. But you can’t show too much. But having all these charts, it gives customers visibility into their history and why they’re spending money with us. And it helps make them comfortable and feel that they are getting good value for what they’re spending.
Chris Churilo 41:14.195 Right. Right. So I have one quick question, one last question. So what are some of the things that you have on the roadmap for the next iteration of this?
John Burk 41:29.085 Nothing at the moment. I mean, I’ve got so much else. This is the thing. Getting to this point and then move on to my other stuff that has been lagging behind while I did this. But one thing that I do anticipate doing very soon is converting my System Metrics Collectors to Telegraf. Because right now, we’re having a real challenge getting the system metrics adopted. Our larger customers that all run Linux supervisors have all adopted it fairly readily. And so we’re getting visibility into their sites. But the midrange customers from the 20 to 50 or 20 to 100, a good chunk of those are running Windows supervisors because they tend to be Windows shops. And so I’m having issues with the packaging there. And I think going with Telegraf would help me drive adoption of the system metrics, which is a value-add to our customers. It would help them and help us as well. So there’s that.
John Burk 42:35.469 Right now, I think the metered licensing usage is—we have some stuff on the roadmap. But it doesn’t involve necessarily the InfluxDB. It involves some automated notifications based—our customers are able to set dollar limits, and, “I only want to spend $500 this month on metered usage.” And basically, the system will cut it. When they set that limit themselves, the system will cut them off. And it will basically stop enabling metered license usage on their supervisors. So right now, I have a notification when it cuts them off. But I just need to go in and add in some notification based on 75 and 90 percent warnings as opposed to the system has disabled metered licensing due to your hitting a limit.
Chris Churilo 43:22.422 That makes sense. That makes sense.
John Burk 43:24.087 Yeah. But right now, my InfluxDB stuff, all the charting, is pretty much—I’m going to call it a 1.1. I’m at 1.1 right now. And the 2.0, we haven’t actually started working on yet because this is still very new to us. But there’s never been any showstoppers. And with the time series data that have required a rethink, because quite often, when you start to do a new product or a new feature, you’ll go one way. And then you’ll hit a road—you’ll hit a showstopper of some sort. And it’s like, “Okay. We’ve got to take two steps back and rethink this.” That’s never happened with the time series data stuff. It’s like, “Okay. How do we do this?” Well, read the docs. “Oh okay. That seems to work.” And then I tweaked a lot of my continuous queries early on in the proof of concept. But what I did was I always had all my retention policies and continuous queries created from a setup script. So I had that automated every time. So basically, I would make iterations in the setup script, and I would just rerun it. So I can basically, if I ever have to rebuild it, I can just rerun the script and rebuild it from scratch. That was when I was running on a single host or the open source because I was always worried about losing that host. So I wanted to be able to deploy another InfluxDB setup in a couple of minutes. So I had that.
Chris Churilo 44:56.791 Cool. Very, very cool. Well, I appreciate you sharing your use cases with us. And I’m sure our customers appreciate it, too. And I also appreciate the kind words about the ease of use. And with that, I think we will conclude our webinar and open it up for questions.
Track and graph your Aerospike node statistics as well as statistics for all of the configured namespaces.
Knowing how well your webserver is handling your traffic helps you build great experiences for your users. Collect server statistics to maintain exceptional performance.
Collect and graph performance metrics from the MON and OSD nodes in a Ceph storage cluster.
Use the Dovecot stats protocol to collect and graph metrics on configured domains.
Easily monitor and track key web server performance metrics from any running HAProxy instance.
Gather metrics about the running Kubernetes pods and containers for a single host.
Collect and act on a set of Mesos statistics and metrics that enable you to monitor resource usage and detect abnormal situations early.
Gather and graph metrics from this simple and lightweight messaging protocol ideal for IoT devices.
Gather phusion passenger stats to securely operate web apps, microservices & APIs with outstanding reliability, performance and control.
The Prometheus plugin gathers metrics from any webpage exposing metrics with Prometheus format.
Monitor the status of the puppet server – the success or failure of actual puppet runs on the end nodes themselves.