Monitoring a High Scale Bidder-As-A-Service (BAAS) on the Cloud
There is a tremendous amount of money that is being poured into AdTech by marketers around the world. So it is becoming strongly important for marketers to be able to see their spend in real time. In this webinar, Ram Kumar Rengaswamy, Co-founder and CTO of Beeswax, talks about how they use InfluxData to collect metrics and events to support the 1 million queries/second performance of their AdTech platform.
Watch the Webinar
Watch the webinar “Monitoring a high scale Bidder-As-A-Service (BAAS) on the cloud” by clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Monitoring a high scale Bidder-As-A-Service (BAAS) on the cloud.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Ram Kumar Rengaswamy: Co-founder and CTO, Beeswax
Chris Churilo 00:00:00.754 All right. Thanks again for joining us. And if you do have any questions for Ram at any point in his presentation today, please be sure to go ahead and put your question in the Q&A or the chat panel and we’ll get those answered definitely before the end of the presentation today. And the presentation is being recorded so you can take another listen to that and we’ll upload it to our website at influxdata.com/_resources. And yeah, I think we’re good. So Ram, what I’ll do is go ahead and let you take it away.
Ram Kumar Rengaswamy 00:00:39.102 Sure, thank you very much. So today, I’m going to be talking about Beeswax and how we’ve been using InfluxData and InfluxDB as it was previously known in our systems for quite some time now. So before I get started, just a quick intro about us. We’re a small company that started about three years ago, founded by myself and two of my colleagues from Google. We’re primarily based out of New York and also happen to have an office in London, and are looking to open an office in San Francisco soon, venture-funded with a Series A last year. And we now have a few big customers that are using our system.
Ram Kumar Rengaswamy 00:01:28.338 So very quickly, not sure how familiar the audience is with real-time bidding, but that’s basically what we do. And so I thought I’d spend a couple of minutes providing a quick overview of real-time bidding. So, what happens today is that most of the advertising is transacted on this really large marketplace that’s called exchanges. Exchanges bring together buyers and sellers. The sellers are the publishers, and the buyers are the advertisers. So whenever someone visits a website or uses a mobile app, the site or the app will send an ad request to the exchange, which will contain some information about the context where the user is, including their user ID. And the exchange will collect this information and broadcast that over the internet in the form of what’s called a BidRequest.
Ram Kumar Rengaswamy 00:02:22.704 And there are entities called bidders that are listening for these BidRequests over the internet. And a bidder is basically a piece of software that’s deployed on behalf of an advertiser in order to optimize their campaigns. So the bidder receives the BidRequest, comes up with a bid price, and the actual creative of the ad that needs to be shown and returns that information back to the exchange. The exchange waits until it receives the BidResponse from all the bidders that it sent the BidRequest to, runs a second price auction, picks a winner, and the winner’s ad is shown to the user. So this whole process takes about roughly 200 milliseconds, and the industry term for the process that I just described is real-time bidding. And as you can imagine, a system like this where the bidder is listening to the BidRequest from many, many exchanges is pretty much listening to the web-browsing traffic of the internet.
Ram Kumar Rengaswamy 00:03:25.627 And so it’s a very high-scale, high-performance system. High-scale because the QPS that’s coming into the system, queries per second, is of the order of millions. And high-performance because this whole process takes only 200 milliseconds. And in reality, the timeout that the exchange sets for the bidder is close to a hundred milliseconds, which includes the down trip over the internet. And so really, the bidder is looking to process most of its requests within 20 milliseconds. So it’s very high-performance, very high-skilled system. Also, it’s something that’s deployed not just in one region in the world, but it’s a globally deployed system because you can have people browsing and wanting to see ads in Europe, and Asia Pacific, North America, everywhere where you can imagine. So, that’s a global system, high performance, high skill.
Ram Kumar Rengaswamy 00:04:23.622 And so it turns out that today, if you’re an advertiser, we have a couple of options if you’re trying to use a bidder in your advertising strategy. You could use what’s called a traditional DSP. And companies like MediaMath, Google, AppNexus, Trade Desk offer traditional DSP where it’s basically UI that you get access to where you can set up campaigns, etc. and then start sort of buying ads on those platforms. But the issues with them are that they are opaque because you don’t get full transparency into what’s going on, where you’re actually buying your ads. It’s kind of restrictive because these platforms are designed to support a large variety of marketers, and so there is very limited opportunities for customization. And so if you feel that these are too limiting, then the only other option that you have is to build your own [silence] it’s constantly—that even after you spend all this money and effort to build a bidder, that it’s actually going to work because there’s a lot of technology and expertise that is required to build these systems, and some of the marketers might not have that. And so basically, the options that were available to our customer were basically these two. You could use a traditional out-of-the-box DSP, or you could build your own bidder.
Ram Kumar Rengaswamy 00:06:04.041 With Beeswax now, we have a third option where the bidder is provided as fully managed cloud-based servers built and deployed on top of AWS. We have done the hard part of pre-building all the supply integrations and all the other and data integrations that are necessary for a bidder to be successful. And what a customer gets is a fully functional platform on Day Zero on their own UI. But unlike a traditional DSP, there is opportunities for infinite customization. We have a couple of real-time APIs that let a customer customize the algorithm that they use, for example, in order to count the price of an ad and furthermore, we also provide opportunities for our customers to be able to augment the incoming request with their own data. So with these two real-time features being a core part of the Beeswax platform, all of a sudden, there’s a third alternative as far as the advertisers are concerned where they could basically roll out their own DSP so to speak on top of the Beeswax platform.
Ram Kumar Rengaswamy 00:07:21.558 So this is sort of an example of some of the things that our customer might get when they sign on to the Beeswax platform. There’s a full-featured UI that comes with all the campaign management that’s necessary for an advertiser to basically use the platform along with the graphs and charts. That sort of gives you an insight into how your campaigns are performing and all of this is controlled via a Rest API as well, so it’s not necessary for our customer to use our UI. They could actually use the Rest API on top of this. So this is sort of an introduction to what Beeswax does and that it operates in this space where we’re basically, from an engineering perspective, building very high-skill, high-performance distributive systems that are globally deployed. So with this in mind, one of the key problems that we solve for our customers is the fact that we actually operate the systems for them on their behalf. And this is where InfluxDB plays a critical role in the overall system architecture of Beeswax. And I’m going to now sort of touch upon on that next.
Ram Kumar Rengaswamy 00:08:53.182 So the global deployment of InfluxDB and the Beeswax stack roughly looks like this, as the diagram shown here. So as I mentioned, we are a global service, so we’re fully deployed in multi—and we’re fully deployed on AWS. So, that’s another thing that I want to make very clear: that our entire stock is built and optimized for AWS. So we are deployed for example in the three different regions: US East which is in Virginia, US West in Oregon, and Europe, which is in Ireland. And in each one of these three AWS regions, we have deployment of InfluxDB Enterprise. We chose InfluxDB Enterprise mainly because of the high availability features that it provides. Because the metrics that are stored in InfluxDB are pretty critical for us in order to be able to ensure that our system is working correctly. So for that reason, we decided early on that higher availability is something that we do not want to compromise on. And therefore, we have Enterprise versions of the InfluxDB cluster deployed in these three different regions.
Ram Kumar Rengaswamy 00:10:24.597 Now one of the interesting aspects I think about our design is that since we are so paranoid about not losing any of the metrics data, [inaudible] Kapacitor, which is also provided by Influx, that copies metrics as they are streamed into different regions into a centralized InfluxDB deployment in one of the regions so that we have not just application within the region, but also cross-region application. And in effect, we end up having basically two copies of our metrics data in two different regions in the world. Furthermore, when it comes to all our alerting and etc., all of that is also driven on top of the data that’s stored in InfluxDB. So we have the deployment of Kapacitor that takes all the metrics that are present in a region, and we have a bunch of big scripts that have been written to other companies’ conditions. And then that ties into PagerDuty and that alerts the person on call to know that there’s something wrong with our production systems.
Ram Kumar Rengaswamy 00:11:51.775 And for the visualization of all the data that’s present in our system, we rely on Grafana. So Grafana is sort of like this common frontend that’s used to view all the charts and the graphs. And for performance reasons, we have deployments of Grafana in every region where we have Influx so that Grafana to Influx is a low latency link. And thereby, we have now eyes into sort of like the entire global infrastructure. So at this point, if there are any questions about the nature of our deployment or why we choose this architecture, please feel free to interrupt me and then ask me questions.
Ram Kumar Rengaswamy 00:12:44.216 All right. So this is some of the high level—how this system is deployed globally. So now moving on to—in addition to sort of performance monitoring of our systems, the other use case that we have for InfluxDB is that we would like our customers, since we are a platform ourselves, we’d like our customers to be able to have access to the exact same metrics that we’re looking at. And so we actually show to our customers the health of our system in terms of the QPS coming in, the Queries per Second that are coming into the system broken down by different dimensions, like Exchange. As I mentioned previously in our platform, we allow our customers to set up their own software for doing BitPrice calculations, etc. And on our side, we monitor to make sure that our customer systems are up and healthy and performing well. And so we have charts and graphs and metrics that capture the health of not just our systems, but also our customer’s systems. And we would like our customers to be able to access all the data that’s relevant to them in our UI.
Ram Kumar Rengaswamy 00:14:15.015 And so one of the things that we offer as a part of our product is a monitoring dashboard that’s built into our UI, which is really sort of using Grafana under the hood and the metrics which come from InfluxDB. And this is a screenshot of what that thing would look like for a typical customer where they could see and query and look at the different aspects of how their installation of the bidder is performing on our platform. And so the point that I wanted to emphasize here was that not only are we using the Influx metrics to ensure that our systems are working correctly, we also offer a feature where the customers can look at how their software is performing on top of our system. And so we also in a sense are collecting metrics on behalf of our customers and exposing that to them through our UI. So with that sort of overview of the feature, now I want to just get into some of the things that we had to do in order to implement that on top of Influx and Grafana.
Ram Kumar Rengaswamy 00:15:36.083 So the first thing that—obviously, one of the first things that got us attracted to Influx in the first place was that it had this data model that was built around databases and that there was some sort of an access control for each database. And so we leveraged that, each Beeswax customer is given a separate database within Influx. We provision a new database user for each customer. And that user is only given read and write access to that database that we’ve set up for the customer. Now as data comes into our system, we have set up continuous queries that are able to filter out metrics based on the customer and then write that into the customer’s specific database. And then with the combination of these three aspects of Influx, what we’re able to now do is set up a data source in Grafana for each customer.
Ram Kumar Rengaswamy 00:16:55.075 And once we have that, then it’s possible for us to now rely on the Auth and ACL model that exists in Grafana where once again, every customer in Grafana is sort of working through their own organization that only has access to their own data source. And with the combination of all of these, it’s now possible for us to ensure in a very secure manner that a customer has only access to their own data, all the way from the metrics that the site on this, on Influx, to the view that they get on the dashboard. And honestly, there’s a lot of engineering effort that’s required to build a system like this. And by leveraging some of these writes or the features in both of these products, we were able to just pull this off without having to do the subsets. Okay, so this sort of hopefully gave folks a sense of how we are relying on some of the features, the auth and the ACL features of Influx and Grafana.
Ram Kumar Rengaswamy 00:18:07.001 And finally, I wanted to spend some time talking about how we have optimized the InfluxDB deployment for AWS. So one of the first things we did was that we deployed InfluxDB on EC2 autoscale groups. And the main reason for using an autoscale group is that it allows us to flexibly scale out the service in case we’re running short on resources like memory or disk space, etc. One point that I want to highlight is that even though this is an autoscale group, we don’t necessarily autoscale yet because we do want a human in the loop as far as adding notes to this customer’s concern, but there is honestly no reason why this cannot be fully automated. It’s just that we feel a little paranoid about this part of the stack just scaling automatic, but it’s possible. The reads and the writes into this autoscale group on which InfluxDB runs are fronted by Amazon Load Balancer.
Ram Kumar Rengaswamy 00:19:26.550 In addition to that, providing the load balancing of all the reads and the write requests, the additional feature that this has is that the Load Balancer sort of acts almost like a meta-monitoring system for Influx because you kind of need something that monitors Influx, make sure that that’s up. And so the ELB health checks are triggered off in case something is wrong with Influx and Influx is down, then the health checks fail, and they’ll raise an alert and send someone who’s on call, will then dive in to investigate as to what’s going on with the health of the underlying InfluxDB cluster. So, that’s sort of how the system is actually deployed on EC2. Now by optimizing for AWS, we were also able to build a few tools that allow our Beeswax employees to access InfluxDB without having to [inaudible] into individual machines.
Ram Kumar Rengaswamy 00:20:28.386 So Amazon provides this service called AWS SSM, which is sort of like an agent that runs on the same machines as the InfluxDB machines. And we’ve written some sort of logic on top of SSM that allows an engineer to issue commands to InfluxDB from their work station from their environment, which are securely transferred using the AWS APIs and then securely executed on the actual instances. And the primary reason for doing this was that we wanted to centralize all the access controls in our system using this Amazon service called IAM, which is for Identity and I think Authentication Management. And by centralizing the access to Influx, also we add IAM. It means that if someone were to leave the company or in case we want to revoke someone’s access or add someone’s access, there’s a centralized place where we could go ahead and enable and disable these features.
Ram Kumar Rengaswamy 00:21:44.643 So by basically leveraging SSM, we were able to enforce access control using IAM, which I think is pretty cool. And then Kapacitor itself is also apparently deployed on an autoscale group. The main reason again was that we felt that our Kapacitor instance could end up eating a lot of CPU and so we wanted to therefore run it natively on a VM, but there’s no reason why it cannot run in a container for example. And then as you guys may be aware, that in order to set up alerts on Kapacitor, you have to write these TICKscripts. TICK is the main specific language invented by InfluxDB. It’s a very powerful language in which you can describe the way you’re selecting rules and conditions. It turns out that engineers don’t like learning new languages, and so we ended up writing a tool in Python that helps generate these tick scripts.
Ram Kumar Rengaswamy 00:22:54.262 So we have a few templatized sort of TICKscripts that we deploy [silence] that have actually been auto generated, then they’re actually installed on the Kapacitor instance through this AWS code deploy mechanism. And again, we rely on AWS for the delivery of all our code and commands, for the same reasons that we want to be able to make sure that all the access control is centralized. We have the AWS IAM system. So in a sense, I guess what I wanted to get across was that we’ve done a bunch of things in addition to just using Influx on the native viewers. We’ve written a few tools that really ensure that it’s possible to operate Influx in a scalable manner and also be able to run it in a very secure manner on top of AWS. And I guess that was my last slide. And at this point, I’d be happy to answer questions.
Chris Churilo 00:24:11.611 Cool. Thanks. I think it was important that you gave us that overview of your company because I definitely appreciate that you really do need to make sure that you keep that system always available with all the bidders, etc. And also, I really appreciate the detail that you went into on your InfluxDB implementation. So we do have a question from Manoj and he asks: “How do you test TICK generated from your Python tool before deploying?”
Ram Kumar Rengaswamy 00:24:46.197 That’s a very, very, very good question. So like I said, the TICKS themselves are templatized. So which means that for example, one of the templates is that a common [inaudible] template [silence] replaced. So in our source code, we have scripts that can replay like a data file into a local Kapacitor instance that can then validate whether the condition on which this alert should trigger on is indeed happening. And then what we do is that instead of calling out the PagerDuty, we call out to a local HTTP server that runs as a part of the test suite. And then we make sure in our test framework that we assert that this endpoint was actually triggered. So, that’s sort of our implement black box way of making sure that this alert actually will work when we deployed to production. So it’s a combination of using Kapacitor replays and then having a mock HTTP server that receives the notifications whenever the alert fires.
Chris Churilo 00:26:16.724 Manoj did that answer your question? Hopefully so. Okay. We have another question from Satorius “So how do you handle data loss between the Kapacitors?”
Ram Kumar Rengaswamy 00:26:33.407 So at this point, it’s not that the Kapacitor is relaying to another Kapacitor. Our architecture is that we copy data from one InfluxDB installation in one region to another—a centralized location in another region. It is true that you could lose some data in case Kapacitor is down. So to be honest, at this point in time, it’s not something that we have investigated too deeply because the whole replication is best in order to increase the redundancy of the metrics data. But I guess one of the ways it could be—we’ve been thinking about this—one of the ways it could be done is you have, for example, a high availability Kapacitor deployment sitting behind, again, maybe some kind of a load balancer because the Kapacitor knows that we have—essentially, acting as really just reading the data from one InfluxDB location to another. So they’re stateless in that sense. So, that’s one way that we’ve been thinking about. And maybe there are other things in the Kapacitor [inaudible] somehow shard Kapacitors in the way the InfluxDB itself gets sharded. So at this point in time, I’m not doing anything to ensure reliable transfer of data across regions, but these people that get to do that, I think would just be to load balance. That’s sort of like something that we’ve been thinking about.
Chris Churilo 00:28:28.444 Well, cool. Hopefully, that answered your question. I have another question from Manoj and he asked, “How do you handle InfluxDB backups?”
Ram Kumar Rengaswamy 00:28:41.028 So At this point in time, we rely on basically having many, many copies of the data geographically replicated. But again, one of the things that we sort of plan is that with the Enterprise version, you can get this tool for generating backups. And so simple thing that we have been thinking about doing which we have not done yet is to generate these backups periodically maybe on a daily schedule and then just throwing them into S3 and then expiring them off throughout.
Chris Churilo 00:29:20.692 Okay. We have a question from Sunil. Let’s see. Okay. So he asks, “My InfluxDB is throwing timeout exceptions after every 10 to 15 days and we need to delete data for measurements to resolve it. It’s happening for the last three to four months. What can the issue be? And do I need to reduce the number of fields or tags?”
Ram Kumar Rengaswamy 00:29:42.637 All right. Without actually looking at the deployment or the version and the number of measurements, it’s hard to answer that question. Before, I think 1.3, there was a problem wherein the amount of memory that Influx server would consume would grow with high cardinality for tags. So imagine if you have these measurements where one of the tag dimensions is the IP address or the container ID or something like that, and if there’s a lot of turn in the IP addresses or the container IDs, then those measurements would end up consuming a lot of memory, which would be one of the reasons why you might have to delete data, I’m guessing.
Ram Kumar Rengaswamy 00:30:38.559 So since 1.3, that has been resolved because index is now persisted to disk. And so as a result of that, there is far more room to grow and add these high cardinality tag dimensions to your measurements. So I guess the first thing is—I’m not sure if the timeout is occurring because the Influx node is crashing or is it some other reason because we have not seen a performance degradation as we had more data. The only thing that we observed was that as we added more and more data into the cluster, it started eating up a lot of memory and at some point, it just ran out of memory and then crashed. But all that has been resolved since 1.3. So again, it depends on why the—in some cases, it might not be even Influx, it could just be the network.
Chris Churilo 00:31:39.833 Well, that’s a first. Do you want a job with InfluxData? We’ve never had a customer that answered one of our support questions. But—
Ram Kumar Rengaswamy 00:31:49.960 I’m sorry, all—
Chris Churilo 00:31:50.926 No, it’s—
Ram Kumar Rengaswamy 00:31:51.174 —these things that we go through ourselves, so that’s why I just wanted to—
Chris Churilo 00:31:54.471 No, it’s brilliant. It’s absolutely brilliant. So yes Sunil, I would give you the same answer that yeah, it could be, but we really want to—if it isn’t 1.3 that resolves, then we probably need to dig in a little bit more into what your schema looks like. So we have got another question, we actually have a couple more questions. Satorius asks: ”What is the type of the EC2 instance and what is the throughput?”‘
Ram Kumar Rengaswamy 00:32:21.153 Very good. As I was describing previously, we are memory bound and so we use the R4 category of machines. Because on AWS, that thing inside that is cheapest per gigabyte in terms of memory. And I believe that this one—we are using R4 8XL nodes, simply because we are memory bound and we just throw a lot of measurements into the—we monitor every aspect of the system honestly—hundred tons of metrics that land into the cluster. And so it’s R4 8XL and many of those in every region. And what was the second part, as to what is the throughput?
Chris Churilo 00:33:12.877 Yep, throughput.
Ram Kumar Rengaswamy 00:33:14.980 So again, I guess I should have described another part of the systems. So one of the things that we do is that we have our own sort of [silence] into the server, it’s probably around 10k, but not more than that. That’s around roughly where we are in terms of writes. And then reads are primarily driven by customers and users trying to visualize the metrics in Grafana. I don’t think it’s that high.
Chris Churilo 00:34:07.061 We have one more question. Manoj, can you retype it in there? For some reason, it got cleared out. I think Manoj here was the last question that I had. Okay, here we go. All right, here it is. Oh yeah. So Manoj asked, ”Are you using Kapacitor topic handlers—using that functionality of the topic handlers?”
Ram Kumar Rengaswamy 00:34:30.877 I can definitely just sit here and nod because I don’t know what that feature is. So I’ll say confidently that we’re not. So I mean, this is something that I will have to defer to you because I don’t even know what the feature is.
Chris Churilo 00:34:46.396 It’s basically a more efficient way of setting up the alerts. But what we’ll do is I’ll point you to the description of how to use it in our documentation.
Ram Kumar Rengaswamy 00:34:59.276 Yeah.
Chris Churilo 00:35:01.146 All right. Wow, nice set of questions. We’ll stick around just a couple more minutes if you guys have any other questions. I finally figured out why I was having a hard time with getting my links to you guys. So hopefully, that’s going to work there. It looks like I was only sending it to Ram for some reason. So now hopefully, everybody gets that link to the community site. And if you have any other questions, please go ahead and put them into the chart or the Q&A. And we will be posting this recording later on so you can take another listen to it and that you can find it influxdata.com/_resources. And like I said, we’ll just keep the lines open a little bit. If you do have a question that you come up with later on, just shoot me an email and I’ll be sure to forward it to Ram for you guys to answer or for him to answer it.
Ram Kumar Rengaswamy 00:36:00.973 Absolutely.
Chris Churilo 00:36:05.381 My one question is, did you—I think you and I talked about this—but how did you actually start with your monitoring before InfluxDB?
Ram Kumar Rengaswamy 00:36:17.249 Oh my God. This is going to go back in history to 2014. So as I said, I used to work at Google before this and I kind of realized the value of having really robust monitoring systems. And so when we started, we—I mean, the obvious choice for us to begin with was Graphite, but we always found the data modeling Graphite to be very limiting. And we started looking around and honestly, I just came across it because this was another company that was initially built out of New York I believe and so we just like accidentally stumbled on it. And then we started poking around and looking at the different features that it offered, especially the fact that you can query this data and you have this notion of database and such. And it was so much better in terms of what existed out there and [silence] it’s far better than anything else that’s out there, so we just decided to—so, that’s basically how we started. And then we sort of been growing with Influx. I mean I think at a point when Influx announced the Enterprise version, we were super-excited and jumped on it because it was around the time we as a company found the need to also have a certain sense of reliability on the Influx collection and storage part of our system.
Chris Churilo 00:37:53.521 So one question just sneaked in. So Manoj asked: ”Have you used any UDS for Kapacitor?
Ram Kumar Rengaswamy 00:38:01.579 No, not at this point, no. We’ve been basically just writing the TICKscripts.
Chris Churilo 00:38:13.641 All right. So I think that’s the end of the questions. Fantastic overview. I think everybody on the call appreciates the details that you went into. And I really appreciate you spending time on this. I will post the recording. And as I mentioned, if you have any other questions, I will pass them on to Ram. And with that, I think we can conclude—And Manoj says, ”Thank you very much Ram, that was very helpful.”
Ram Kumar Rengaswamy 00:38:41.417 Absolutely, love to help.
Chris Churilo 00:38:43.429 Cool. Thank you so much everybody. And Satorius also says thanks. Well, it looks like a lot of people are giving you thanks and kudos. So thanks again Ram.
Ram Kumar Rengaswamy 00:38:54.582 Yeah, absolutely. Thank you very much.
Chris Churilo 00:38:56.349 All right, goodbye everybody.
Ram Kumar Rengaswamy 00:38:57.901 Bye bye.