How a SaaS Leader Uses Monitoring to Lower CapEx, OpEx & Provide Near 100% Uptime
In this webinar, Sanket Naik, VP of Cloud Operations and Security and Hans Gustavson, Director of Site Reliability Engineering at Coupa, will be sharing how they use InfluxData as a key component to derive operational metrics of their Spend management platform. In particular, they share their team’s best practices with using InfluxData that helped them achieve a consistent track record of delivering close to 100% uptime SLA across 13 major product releases, 5 major product module offerings.
Watch the Webinar
Watch the webinar on how an experienced SaaS leader uses monitoring to lower CapE, OpEx and provide a near 100% uptime by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “How an experienced SaaS leader uses monitoring to lower CapE, OpEx and provide a near 100% uptime.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Sanket Naik: Vice President of Cloud Operations and Security, Coupa Software
• Hans Gustavson: Senior Director of Site Reliability Engineering, Coupa Software
Sanket Naik 00:00.452 All right, thanks, Chris. And good morning everyone if you’re in the US or good evening wherever you are in the world. We’re happy that Chris has invited us to talk to you today. My name is Sanket Naik. I’m Vice President of Cloud Operations and Security at Coupa Software. Along with me, I have Hans Gustavson, who is a Senior Director of Site Reliability Engineering. So our mission here at Coupa is to enable organizations to spend smarter and save money by delivering the most innovative, easiest to use, fastest-to-implement and cost-effective technology in the world. So whether it’s an engineer in a technology company who needs to purchase a mouse, a manager in a retail store who needs to order supplies for the store, a worker at a construction site who needs to order manufacturing equipment, or a nurse in a hospital who needs to order gauze, or a marketing person who needs to get contract services or a market campaign, what they can do is we help all of our customers order their goods, receive goods, manage the inventory, receive invoices, and reconcile payments to suppliers. All with proper management approval, with full visibility to how that impacts the budget, and within the guidelines that they’re getting from their finance or procurement department.
Sanket Naik 01:31.797 And so how this helps our customers is that once they get this end-to-end spend within their system, with a system that’s very easy to use—I mean, think of us as the amazon.com for businesses. It’s that easy to use. And once end users are using the system, they have all this transactional spend within the system, now the finance organizations and our customers can do some very interesting things. They can analyze a spend against their own data, they can benchmark their spend against others in the community, they can find additional savings through contract negotiations with suppliers, or new payment discounts with suppliers. They can check the feedback on suppliers and find quality suppliers, and then they can do group orders to find discounts. So we are primarily based out of San Mateo in California in Silicon Valley, and we have global support and delivery centers in Dublin in Ireland, Pune in India, and Reno in Nevada. Our customers span across the globe. And we process a high volume of transactions on a daily basis. And we’re constantly looking to provide a near 100% uptake.
Sanket Naik 02:49.732 So when we looked at our landscape from monitoring and technology in 2016, we realized that we needed to do something. We needed to refactor our approach in order to meet the success metrics we had in mind and to deliver on our core value number one, which is ensuring customer success. And we saw certain limitations. We saw limitations where we had multiple point solutions. Some for system monitoring, for application performance monitoring, for business process monitoring. And these disparate systems had some limitations. They had limited capability to do visualizations across aggregate transactions. We had restrictions as how much data we could retain to do trend analysis. We had our data locked into data storage with certain commercial vendors, and we were not able to extract the data to do analysis. And then we wanted to collect a lot of additional metrics. I mean, ideally, we want to collect as many metrics as possible. We want to empower as many teams as possible in Coupa. Not just operations, but also support, but also engineering, to collect as many metrics as possible and quickly and rapidly create new monitors. And this was very time-consuming with our existing landscape. And then one of the things that we wanted to do, that some of us had done in the past at other companies, was doing much more advanced capacity planning and forecasting. But doing it with the tools that we had was complex and manual.
Sanket Naik 04:31.588 And so we did some research. We talked to our peers and we discussed internally, and we came up with what we call our monitoring maturity model. And we see this as an evolution. So it starts with making sure we’re collecting all the data that we’re able to correlate and [inaudible]. Then we are able to look at historical events and do trend analysis, and then use those trends to do forecasting, so that we are better prepared, more proactively, for future events. And then as we get more mature, being able to do more faster, and in some cases, automatic root cause identification, and then enhance our capability to do auto remediation. We essentially built our platform as in [inaudible]. We use configuration management like Chef, and we want to do more auto remediation so that be able to fix issues faster, in some cases without having to alert people. And then going beyond that, being able to do more predictive [inaudible] analysis around many scenarios, so that we can plan our future growth better that planning infrastructure growth or planning certain business models.
Sanket Naik 05:52.183 And so the next step, obviously, was to look at, “What are the available solutions within this space?” And there were certain key selection criteria that we narrowed down to. One was that we wanted to make sure that there’s a lot of [inaudible] whether it’s operations, or developers, or support of business. We want people to be able to get to the data and be able to do debugging if they can. And we wanted to enable people, and empower them and get out of their way so that they can be successful. We wanted a near real-time pipeline, capture a lot of events, retain metrics for a long time, be able to do a lot of powerful search and visualization. And then obviously since we are trying to deliver a high level of uptime, we also wanted to make sure that the system was scalable and highly available. And so we looked at about nine different solutions, both open-source and commercial. We narrowed it down to three finalists, and did a proof of concept with them. Hans will talk about how we rapidly prototype these solutions. But essentially we had set a goal that we want to complete the evaluation in two months. But with influx data, what we realized as we got into the weeds of prototyping it, that we were able to implement it so quickly that we were able to finish a prototype in four weeks with some tangible results that we were able to use right away. And then we were able to do a full rollout in about two months.
Sanket Naik 07:41.290 And so, this was a great experience. And we are at a point now where it is materially helping us in some key areas. It has improved our visibility across many areas that we previously couldn’t see. We are able to do a lot of proactive anomaly detection. So here you see a heat map that we are doing across certain key transactions across all of our customers. And so whenever we see something that’s a hot spot in here, and then we’re able to dive into that with our developers, and proactively resolve an issue, and we have changed the [inaudible] in certain areas where, instead of our customers finding and reporting certain issues, we’re able to proactively identify and fix issues even before customers find them. So I think this is a great result that we have seen out of this. And so now Hans is going to now talk about how we went about implementing it.
Hans Gustavson 08:48.793 Thank you, Sanket. Thank you, everybody, for allowing us the opportunity again to come and speak to you and share our journey and experience with InfluxDB. As Sanket said, we looked at a lot of open-source and commercial options, and what I want to discuss is our prototyping of the InfluxDB TICK stack, and then how we iterated and where we are today. And then share some of the things that we’re a little long term, forward looking—areas that we want to continue to develop upon. When we started the POC, our objective was to very quickly identify if it could satisfy a number of core use cases. And so to that end, we wanted to keep it very simple, and iterate as we went through that process. And in the diagram, which you see is a very simple representation of some of the system, platform, and application-level metrics and some of the tools that we use. But we narrowed down the POC to simply—we wanted to roll out the Telegraf agent and evaluate the various plugins that come natively with the agent. We wanted to collect that data and store it inside the InfluxDB Time Series Database. And then we wanted to be able to visualize that data.
Hans Gustavson 10:10.232 At the time of our POC, we were using—Chronograf was not generally available. It was bundled with the Enterprise package, so we went with Grafana. And if any of you have been evaluating various solutions out there, Grafana is well adopted, provides very rich visualization and graphing capabilities. And then additionally, we wanted to be able to begin to evaluate monitoring capabilities. So be able to write monitors and then test their learning. So we implemented Kapacitor as well, and started to create some very simple TICK scripts that were using batch and stream processing capabilities. You can go to the next slide.
Hans Gustavson 11:01.613 So our initial architecture, again keeping it simple, our services are hosted on AWS. We are operating in about seven different regions within AWS. So initially we wanted to be able to test collecting metrics in each region. Again it was limited to just the system and platform metrics. We did not introduce any kind of high availability or fault tolerance. We were not concerned about data retention. Again this was just more of a functional test to satisfy some very basic use cases. And originally we had thought that we would be deploying an Influx database in each region, and having a Grafana localized in that region. And later I’ll talk a little bit more about what we did in the final solution. And then the deployment was predominantly manual. So we hand-built the server side, so that includes Influx, any other relays that we used, as well as Kapacitor in the Grafana’s systems. For the telegraphics, we did a leverage our shaft configuration management to deploy that and set that up. And then once we deploy the architecture, we did make sure that—one of the key points for us was make sure we had a small set of individuals who’d be testing and validating the system. So that included participation from the cloud operations team, as well as development.
Hans Gustavson 12:42.763 And the results from that initial test—it literally took us, as Sanket said, about couple of weeks to roll it out, collect metrics, and start to get the type of visualizations that you see in front of you. This is just a simple screenshot. It’s very powerful, from our perspective. One of the things that we do within Coupa is we have a concept of a deployment, which is just a logical grouping of various servers that perform different roles, and being able to see how those systems are behaving together. And so what you’re seeing here is a collection of those systems, and just a simple metric of a five-minute load average. But this is something we did not have in our legacy solution. In our legacy solution, it was using a more traditional MRTG or RD type of charge that would, as the data aged, it would normalize, and roll it up and summarize. So I could not go back, for instance, a week, and actually look at the granular results over, let’s say, a 15- or 30-minute period. With Influx with Grafana on top of it, it was very powerful and allowed me to do that easily. In addition, Telegraf just natively came with a huge number of metrics that it’s collecting. So it was much more rich, in terms of the metrics, than, again, our previous solution. So while we don’t use all of them, we do collect them, and we’ve instrumented our dashboards for those. And then again, it’s very easy to create use case-specific dashboards and share them. Again, this was something that was not available to us in the past. Somebody created a view in our prior tool, they were unable to share that. And one of the biggest impacts was our ability to do capacity planning now, in a way that made sense for us. In the past, we had to collect metrics from individual systems, take the data into separate spreadsheets, and run through different scenarios. With this, we could build a contextualized view, and immediately be able to go through our environment and identify areas where we maybe have to scale up or scale down opportunities.
Hans Gustavson 14:59.549 But it’s all not perfect, right? This was a learning opportunity for us. So things we learned through that first phase was really, we had to be cautious about queries. Sometimes, we would create queries. And let’s say, queries show me a month worth of data, across a large data set, and the system would try and fetch millions of data points. And that literally would crush our little single node Influx database server. And so that was really an issue in terms of how we architected the environment. We also found that the idea of having a Grafana end point in each region did not make sense. There were too many end points to look at. We felt that we wanted to have as few end points as possible. We also found that users were creating a lot of dashboards, and we quickly had a problem where we were unsure which dashboard do you use, and the data behind it. So there needed to be—for the dashboards, there needed to be some sort of control around which dashboards were published, areas that people can kind of create and play, and not worry about impacting the ones that we want to share with a broader organization. Excuse me. And then last, TICK. We were, at that point, just getting our heads around the TICK scripting language. And so that took some time for the engineers to start to get the basics of how to use. But I have to say, we very quickly realized that Influx and the TICK stack was the most promising of the solutions that we evaluated. We already knew that we wanted to stay with this long-term. By the end of our POC, we were already discussing how we would reimplement it as a long-term solution.
Hans Gustavson 16:56.565 And so what I’m showing you now is the second generation. This is what we’ve actually have deployed out into our various environments. This is representative of our development, staging, and production environments. So, one, we want to make sure was high availability. To accomplish that, we did procure the Enterprise entitlement from Influx, and we’re taking advantage of their HA capabilities, and I’ll share a little bit more about that in the next step, in the next slide. We also made a decision to ship all of the metrics from the remote regions into a centralized region. So what you see is all of the regions in the EU, the Asia Pack, and the West Coast of the US are being forward and back to the US East region. And by doing that, we have two Grafana end points. So we have one that is specific for the non-production environments, and then, we have one that is for production. And in addition, to help with some of the performance of our queries and to protect us from some people writing queries that might impact the infrastructure, we implemented retention policies that essentially age the data into different databases.
Hans Gustavson 18:19.076 We also used this time to automate how we actually deploy the infrastructure, so we used shaft to create the automation, to deploy the Telegraf agent onto the servers, and we also have depended upon a server’s role. So we use a tag from EC2 to identify a server’s role. And based upon that, we then enable the specific plugins for that server to begin admitting. So once, for instance, if we’re building, let’s say, something we call a utility server, it’s running things like Sphinx, Memcached, Redis. Those plugins will be automatically enabled to start shipping metrics. Additionally, for the server side configuration, for Kapacitor, for Influx, Grafana, we introduced Chronograf as well. We have automated the build out in deployment of those servers. We also made a decision on the Telegraf configuration, to create a standard base profile. I talked about that just a second ago. In terms of depending upon the server’s role, it will deploy a standard set of checks always. But then, there are times where a developer may want to have additional metrics instrumented on top of that server. So we have a method now, a solution where again we ensure that we have the basic metrics always being sent from that server. And then any additional ones, the developer can write and add that into the extended profile.
Hans Gustavson 20:06.346 We also integrated it to our single sign-on solution, and we enabled SSL across all of the connections between the clients and the various endpoints in the environment. And we used this opportunity to start to bring in additional data sources. So we started to go beyond just simply the system and platform metrics. We use New Relic for our application performance monitoring. And we started to collect application performance metrics from New Relic and store that within Influx as well as CloudWatch. For some of the services in CloudWatch that we use, like SES, elastic load balancing, we pulled metrics for those as well into the system. And then we implemented the Chronograf data. Last year, Influx open-sourced that and made that generally available to everybody to use so we were starting to play with that. And then we began to put a plan in place to port our legacy monitors into Kapacitor. So we’ve been working on our first set of monitors that we want to pull out of the system.
Hans Gustavson 21:14.270 In terms of the server architecture, this is just simply a visual diagram showing—in our US East region, we have a cluster of three nodes. The entitlement’s based upon the size of each one of those nodes, how many CPUs you have. And then we’ve also set up a dedicated cluster for the meta nodes and have placed all of this in front of an elastic load balancer. An elastic load balancer also is sitting in front of our Grafana and Chronograf endpoints. So, as metrics are shipped from a region, in that region they’re sent through an Influx relay, and then they’re forwarded into the US East region through the elastic load balancer, and then stored on the Influx database server. From an end user perspective, we have users coming in through a reverse proxy and accessing the Grafana and Chronograf instances. And we have multiple instances, so we have high availability on there as well.
Chris Churilo 22:19.357 Hans, would you mind just making that statement again? For some reason, some of the words just got lost in the internet.
Hans Gustavson 22:26.629 Which one? The end user or the server?
Chris Churilo 22:29.688 Just the server side.
Hans Gustavson 22:32.136 So, again, on the server side, what I’m showing here is how we laid out our Influx database nodes. So we have three nodes in our cluster and then we also—it’s dedicated for the Influx database data. And then we have another cluster that is dedicated for the meta nodes. And we’ve been very happy—I’ll talk about the benefits of this. But all of those are sitting behind an elastic load balancer, so any of the traffic that is coming from the remote regions, the Influx, the metrics data coming from the Telegraf agent in a remote region would route through a Influx relay in that region, and then that relay will forward it to the elastic load balancer end point, and then into the Influx database cluster.
Hans Gustavson 23:33.175 So some of the benefits that we’ve seen again, from this last iteration, the biggest one is just a huge performance improvement, going from single node, while very performing, to a three-dot cluster that is scaled properly, has been immense. The time it takes for the queries to run and produce results is significantly faster than before. For an end user, it provides a very satisfying experience of not waiting for the data to be rendered. We also now have the ability to overlay our application platform, the system metrics, again in ways that we have not done in the past. So, in the past we had application metrics locked up in New Relic and system metrics locked up in a legacy tool. Now that both of those are collocated, we’re able to look at the application metrics, so things like how long page view iterations are taking, or the number of pages, and then how that correlates to system and platform resource utilization.
Hans Gustavson 24:49.332 Development now, after having used the tool, is very excited. And have added to their road map a number of projects where they’re going to be leveraging the TICK stack. So, for instance, in our CI/CD pipeline, we use Jenkins. Those build jobs, we’re going to be emitting metrics out on the success in the states of those jobs. We use a SaaS-based project for application errors. We’re looking to have those now injected into the stack metered security events, right? So, the application security team is now looking to inject metrics in there. And lastly, the application teams, now we’re instrumenting in the stack application metrics themselves, so we’re passing those through a—we’re a Ruby shop, so there’s a Ruby library that’ll emit stats in metrics and then forward it into TICK for us to collect and measure. So we’re right in the midst of that, and I’ll talk some more about some of the things that we’re doing there.
Hans Gustavson 26:10.142 And then what we’re seeing also, I mean, is, now that we have solved the basic use cases, a number of teams are working on more advanced use cases. And using a lot of the native algorithms that come out of Kapacitor as well as building our own models. So I think Sanket showed a chart, a graph. One of those was built by our cloud engineering developers basically pulling a number of metrics evaluating historical data with current data, and making a prediction on—providing a view that shows us where, potentially, we have customer issues. So it’s been very, very valuable. We’re actually more proactive now [laughter] than we have been in the past, with this capability. And we’ve extended this to more teams. So, besides just cloud operations and development, we have teams and integrations and throughout support that are using the tool.
Hans Gustavson 27:13.118 So what are some of the challenges we still have? As we are allowing developers to use the tool and write monitors, for instance, we’re realizing we need to provide a developer guide with some standards and best practices. And this is going to ensure that teams are very basic. Let’s say if their writing some sort of alert standard, we want them to use a certain format for those, so we have consistency when it goes into our alerting tool. We’re also wrestling with statsd and some of the application metrics, in terms of how we configure those metrics to be shipped out. So we’re having to constantly iterate very closely with the team on the format, how we’re retaining the data. In fact, we’ve kind of overwhelmed the application metrics database. But we are working with Influx on this very closely, and we’ll optimize this. So again, this is an area that we’re learning and working through. We’re continuing to learn and expand our knowledge around Kapacitor in the TICK script language. And lastly, for the monitors one of the issues we have is we’ve been porting, I mentioned earlier, we were porting a lot of our legacy monitors into Kapacitor. And at the scale we’re operating, the use of static threshold base monitors really just doesn’t work. We get far too many alerts. And so, this is where we’re looking at some of these advanced features of Kapacitor to build more intelligent algorithms and models for generating those alerts. So, we’re very hopeful in what we’re seeing thus far and how it can solve that going forward.
Hans Gustavson 29:09.635 So, some of the things that—we’re forward looking, so we really want to get into predictive analysis. And what we mean by that is, we’d like to have—since we’re sending all this data, all these metrics into a single area, being able to go through that and observe different patterns and be able to predict a forecast out, there’s going to be an issue with an issue with a customer or environment next week. In fact, we’ve worked with some, a startup at this point, to look at using neural networks to do some machine learning on that data, so we’re exploring that still. We want to trigger automation, based upon the metric, so our environment we auto scale. So, depending upon a variety of metrics that we evaluate, we may increase and scale out the platform and then scale it back. And so, that’s something that we’re working on actively right now to leverage Kapacitor to monitor that and then trigger our automation to perform those actions. And then in addition, perform auto remediation. So if there’s a certain type of event it sees, again it would trigger some external automation to go and address that issue. We are evaluating Chronograf. In fact, I’ve spoken to some of the folks at Influx and are excited about the road map and vision they have with Chronograf. It would be great if we could go into Chronograf, the UI, and be able to write a query, create a visualization, set alert thresholds, and then point it at a notification channel, as opposed to writing raw TICK scripts ourselves. This would allow us to be more self-service and include a wider audience to be able to come into the tool and do this, so that’s something that we’re looking at. Additionally, we’re looking to enrich our metrics and dashboards through adaptations and markers. So, we are updating our services on a daily basis. As I’m sure many of you are aware, when you introduce change, there’s always some sort of effect. We want to be able to see those different types of changes in our metrics, in our dashboards, and be able to see the effect or change it may have had on that, and then again trigger notifications or take some action.
Hans Gustavson 31:41.614 One of the other areas that we are actively working on as well is conditional alert routing based on host stay. So, right now our Kapacitor monitors run against a host regardless of whether it’s maybe in maintenance mode, or it’s just being provisioned but it’s not truly operational. And so our objective is to introduce use of tags, state of a host. And so, when the monitor evaluates some metric, it will look at that state and then determine where to send that. Does it send it to an on-call SRE engineer, or does it just send it to a database where somebody who might be doing a maintenance would look and see it, but not wake somebody up because it’s not really not actionable? And then traceability. One of the things that we use some of our other tools is traceability. If I see a certain event happen in the system, being able to follow that through the various other platforms and services. We do have micro services within Coupa. Being able to do that would be something that we would see as very valuable. Not quite sure how we’d solve it with the metrics, but it’s something again that we’re going to look at and see what’s out there.
Hans Gustavson 33:03.990 And then lastly, we want to be able to use these metrics to enhance our status dashboards and other views that we provide to customers. Currently all the metrics that we’re capturing and leveraging are for Coupa specific, Coupa internal employees. But eventually we’re looking to leverage this and share it with our customers.