How Houghton Mifflin Harcourt Gets Real-time Views into their AWS Spend with InfluxData
In this webinar, Robert Allen, Director of Systems and Software Architect at Houghton Mifflin Harcourt, will provide an overview of how they use InfluxData. Their use cases span from standard DevOps monitoring, to gathering and tracking KPIs to measure their online educational business, to gaining real-time visibility into their AWS spend that covers several accounts across multiple business units and many developers.
Watch the Webinar
Watch the webinar “How Houghton Mifflin Harcourt gets real-time views into their AWS spend with InfluxData” by clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “How Houghton Mifflin Harcourt gets real-time views into their AWS spend with InfluxData.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Chris Churilo: Director Product Marketing, InfluxData Robert Allen: Director of Systems and Software Architect, Houghton Mifflin Harcourt
Chris Churilo 00:00:00.000 And it is three minutes after the hour, and so there will still be a few more people that are going to be joining us, but we’ll go ahead and get started. So, once again, welcome everybody to our Tuesday webinar. And we’re really excited today to have one of our wonderful customers, Houghton Mifflin Harcourt, Robert Allen, in particular, presenting their experience with InfluxData. And I want to make sure that everyone knows, please put your questions in the chat or the Q&A panel, and I’m probably repeating myself here, but I do want to make sure that you guys get all of your questions answered. And also know that we are recording this session. So we’ll post this on our website, so you can take another listen to this at your convenience. So with that, Robert, I will turn the ball over to you.
Robert Allen 00:00:50.123 Good. Thank you, Chris Hi, welcome. My name is Robert Allen, I am a Director of Engineering for Houghton Mifflin Harcourt. So some of you may be wondering, who’s Houghton Mifflin Harcourt? We are a nearly 200-year-old publishing company. We primarily focus on K-12 education, both learning materials, as well as assessment materials, and we publish a lot of books. Some of the books that you may be familiar with are Lord of the Rings, Tolkien’s works, Life of Pi, etc. So in our agenda today, we’ll talk about the reasons we chose InfluxDB over many of the other Time Series Databases and other forms of storage. And also, talk a little bit about what InfluxDB has brought to our engineering efforts at Houghton Mifflin Harcourt, as well as how we use the TICK Stack. And then we’ll get into how we ingest, monitor, and alert on AWS programmatic billing access. The title says Real Time, the reality is it’s more near-real-time. It’s only as real time as the estimated billings published each day, which is roughly, I believe, it’s midnight, yeah, midnight, ETC every day.
Robert Allen 00:02:30.472 So why InfluxDB? Well, we tried a lot of different Time Series Databases to start with. Our first, it was sort of reactive. We started building this project that I’m a director of, sort of, under a lot of pressure and we needed some way of collecting metrics and storing. We needed it quick. We had started with Graphite, and then we’d tried Prometheus. And while each one was an improvement over the other, it didn’t quite fit our long-term needs. And ultimately, what we were looking for was the ability to persist the data for long periods of time. We wanted to have more control over our downsampling. We didn’t want to sacrifice any performance for the actual downsample resolution of the data. And we wanted an easy way for developers to be able to start using that data. And more importantly, because our team is very small and will likely be that way for a long time, we wanted minimal operational overhead. We didn’t want to have to deal with a lot of operational manual tasks and activities. And we really wanted as much detail in the dimensions of data using tagging or labels, if you’re from the Prometheus, to actually give us a broad but also very narrow insight into our data. And all of these things, Influx has really filled us the bill in all of this.
Robert Allen 00:04:13.336 I mentioned the SQL-like language, which is love it or hate it. SQL has been around for a very long time. It’s ubiquitous in pretty much across all engineering efforts, which makes it an easy step-in for engineers to adopt and learn, and feel comfortable with. Operationally, it’s been very, very nice to work with. And now that we’re running Enterprise, there’s some features in that that also make it very nice to work with for application, etc. But, even prior to Enterprise, it was still very friendly from the operational standpoint. Tagging is robust, and with the current version 1.3, the new storage engine really has improved a lot of the performance issues around cardinality and sort of expanded how much cardinality you can actually work with in the database. And then the other tools, everything related to capturing those data-Telegraf, Chronograf, Kapacitor, and Influx itself-they all work very well together as a suite of tools that solve all of these problems very well.
Robert Allen 00:05:45.481 So, what we do with the TICK Stack is in most ways, not very different from anybody else’s work with any form of metrics. I mean, we collect metrics using Telegraf. Historically, we’d used Collectd. We’d used StatsD in some places. But what we’ve found, the Telegraf solution is very clean, very nice to work with, very easy to configure, and you have a lot of control around the automation that goes into it. It’s been a boon for us, really, because making the switch from what we were doing to Telegraf was generally a very painless process. And we don’t change things very lightly. We change things a lot, but we don’t generally change them on a whim, and there has to be a pretty compelling reason to do it. Telegraf was able to give us plenty of reasons that it was compelling to make the move. We’re able to ingest the services wants and in the future, we’ll be able to fan out a lot of this using Kafka to ship metrics, as well as consume them and to Influx or into other services in the future.
Robert Allen 00:07:18.726 And then finally, the one to many, is a nice solution for us because we run a lot of containers. Everything from the application side of things, not the operational side, are run in containers. And in some cases, each of those containers or some of those containers may run their own Telegraf instance to monitor itself and that way the engineers are able to maintain their own custom configuration for the various metrics. If it’s not InfluxDB line format, say, it’s the old Graphite format, you can easily write filters that converts that data very nicely into Influx format so that you don’t lose any resolution in your various metrics.
Robert Allen 00:08:14.607 Influx. So one of the things that we really found attractive was the fact that we’re able to support individual databases for each of our development teams. We currently have I think 24 or 25 of what we call “roles” developed, and it’s usually a functional team or group of individuals working to solve a particular product or a platform service problem within the company. And each of those, they’re able to maintain their own databases, their own retention policies, with Influx. The general operational metrics, as well as common metrics across all of these services, are stored in a common database that are then rolled up across the team. The retention policies, the segregation that we get short-term data, long-term data, is a very nice change from, say, Prometheus, where it generally should be accepted as ephemeral data, and also, Graphite where its Whisper and all of that was just problematic and heavy for us. We also store annotation of events, curl calls from, say, Jenkins, for deployment, pushing those points into Influx and then we’re able to use those for annotations with Grafana, etc.
Robert Allen 00:09:46.661 And then there’s Kapacitor. Then we have teams that are running Kapacitor either separate as ephemeral task in Mesos, or we also have it as a service where teams are able to authenticate, either set custom jobs or use templates for the DRY approach (Don’t Repeat Yourself). The ability to write user-defined functions. All of these things really made it very attractive and helped solve our problems. Finally, what you may be here for. So to talk about the AWS billing. So we have a large number of accounts at Houghton Mifflin Harcourt. We have 23 accounts-well, actually, it’s around 23 accounts, but not quite 23. We have a lot of accounts. And of that, each one manages their own services within that and the billing gets quite gangly and queried out. Each month, we ingest about 23 million line items from AWS. Each one of these we convert into points and they’re at one-hour granularity, so every line item represents the billing for one hour.
Robert Allen 00:11:08.240 At AWS, we also capture all the resource IDs. This would be like the iDash, whatever for the instance IDs. It might be ARMs for non-instance type services. And what that helps us do is-and couple that with the custom tagging. We’re able to use all of those as tags in our InfluxData, and then we’re able to break down and really look at all the different ways that either products are consuming AWS services, or the cost changes, deviations from what we expect versus what we know. And these are all quite nice. Before we were limited, before 1.3, we did have cardinality issues because-and I can’t tell you what that cardinality is because I don’t know and because there are some technical limitations with it still. But it’s very much improved with the new TSI 1. I hope I’m saying that right [laughter]. But that’s performed quite well for us now. But the bad is that I’m not able to list tag values or keys currently, because the cardinality is so high. It appears that the storage engine just does not return that query. But, as far as the graphs go, it’s quite performance still.
Robert Allen 00:12:39.128 How we do this? So we have two retention policies. We have a five-week retention policy, and we have an unlimited retention policy. Currently, the five-week retention policy is for the non-invoice data. So that’s all the estimated data that gets published out daily by Amazon. And so that has no invoice ID attached to it, for each record it has estimated. And so we don’t keep that long-term, we keep that short-term. We let it just expire off after five weeks. We use that for day-to-day monitoring where expenses go up, change. If we see a sharp spike in certain services, we may alert on that. We also use it for investigative purposes. And one example of that, we recently had-well, it’s actually when we first started doing this-when we first started doing this, we were able to actually go in and investigate each product line. And CloudWatch, we noticed that there was roughly a $3 an hour expense related to-actually, it was CloudTrail-to CloudTrail. And as we started investigating-it was our understanding that CloudTrail should be free. And so as we investigated it, it turned out that somewhere, somebody had enabled second CloudTrail streams for the same availabilities on and while the first one is free, any subsequent ones after that are not.
Robert Allen 00:14:16.864 And as it turned out, having multiple streams had ended up costing us $3 an hour, which is about $25,000 a year. We were able to identify that very quickly. We also use it to monitor our reservations, which because of this data, we’re actually able to discover that the way reservations are handled in multiple accounts-so if you have a top-level account, your billing account, and then you have downstream accounts-currently, AWS doesn’t really handle that all very well, so we may buy reservations for our account. But, because they’re purchased with a top-level account, it’s sort of a first-come, first-serve. So, just because you went out and bought these reservations for your account, that doesn’t necessarily mean that you’re going to get credit for those. And so we see a lot of deviations in what we expect to see for reserved instances versus non-reserved instances. And so we’ve used a lot of this to actually open a dialogue with AWS and try to get that improved over time. Well, it’s still ongoing. But it’s been very helpful for us to know where we’re actually at versus where we think we are.
Robert Allen 00:15:46.860 We use the custom tags for cost allocation of products. So we have all these different apps. We use a tag called cost. In this particular screenshot, there’s 80, but it’s actually more than that. That was as data was being loaded in. But each one of these are associated with a particular product. Some of these in that list you might recognize. But what we do with that is it gives us a quick way, now near real-time, within 24 hours, to monitor the expenses or the costs with infrastructure for each individual product. This has been very helpful in creating awareness throughout engineering. It used to be, without having this level of insight into where your expenses were at, it was sort of just a black box. So engineers were inclined and really not-they were incentivized to just get it working with no true hard regard as to what the cost was. And what we’re trying to do now is we’re trying to get this more at the developers and engineers so that day to day they actually have that closed loop of feedback into what their decisions are costing.
Robert Allen 00:17:26.508 It also provides an interesting way to where you can correlate the actual application workload. How many requests it’s dealing with from day to day and then actually see how does that or doesn’t affect the cost. Do you see when-back to school is our busy time of the year, which we’re just really coming towards the end of it. Where everybody’s going back to school, they’re rostering up. We’ll roster millions of students. In August and July, that all starts ramping up and we’ll need more batch workers. We’ll need more resources. So we can see all this rostering coming in and we can also see our cost going up. And then we should also expect to see that as the rostering goes down, then we know that we should be expecting to see infrastructure also going down. And so all of these things coming together, bringing the performance as well as the fiduciary aspects of running your infrastructure start coming together. But anyways, I’d hoped to do a demo. I’ll do my best here. One second.
Robert Allen 00:19:07.846 So this is a limited subset of one account. This was actually an engineering account. And I mean, there’s not-I mean, we break it down by product. We also break it down by hourly cost because that’s really like run rates. We’re always interested in what our run rate is. And then we also break down by certain products. So for EC2, we’re able to look here and we can see that-we would actually expect to see the non-reserved instances being much lower than that because we’re very close to probably 85, 90 percent of everything being reserved instances. However, because of AWS accounting, we don’t always see what we expect to see there. And so it really helps us to work with accounting, work with AWS, and ongoing, try to get that all sorted out and straightened and worked out where we can get more accurate billing at the account level. You’ll see these green lines here. These are the annotations. So we actually use the CloudTrail logs. So as they go into either ElasticSearch, or where those events get pushed into InfluxDB, we were able to use those as annotations for our graph. So when we spend up like our dev cluster, which may only run for three or four hours to do our testing, we can actually see when new instances are brought up. Usually, it should be able to correlate that with cost increases, etc. That’s one thing. We use it for security.
Robert Allen 00:21:13.404 So the CloudTrail logs for various security events, we’re able to graph those, like logging in, account creation, things like that. These are all things that we’re all able to use with this. Additionally, we have ElasticSearch. These all pretty much look the same. And also, suffer from the very same limitation of not having accurate accounting. One thing you’ll notice here is that it’s roughly-it looks like it’s probably-02:00 UTC is actually when the billing gets processed. One of the things we do at that and how we actually process this billing is-so the billing is shipped to an S3 bucket automatically. We have an S3 bucket event. So when files are uploaded to that bucket, it triggers an SNS event for an SQS queue, which we are then watching with the script. When it sees that event, we then check to see if it’s the file that we need to process. That file is the resource IDs in tags file. And then we pull down that file from S3, and then we load it in. You don’t have to worry necessarily about going back and deleting data, previous data. Because if you write the same points and because they all have the same timestamp, you’re writing the same point, not actually duplicating that data, which is a nice feature. So with that, any questions?
Chris Churilo 00:23:04.463 So let’s go to the Q&A section. And the first question that we have is, “How are you collecting billing metrics?”
Robert Allen 00:23:14.953 Right. So the billing metrics are-so we use a programmatic billing access from Amazon and I’ve got some links here that really talks about it. But Amazon, basically, they do multi-billing which comes out in a massive CSV file. One month is raw text CSV. It’s about 900 GB. It’s roughly 23 million line items. That’s what we use for our billing. There are potentially other ways that you can capture some billing metrics, but we don’t need absolute real-time. We’re more interested with what happened yesterday, etc. And so for that, we use the CSV file, the Amazon account billing, the programmatic billing CSV file. Now there’s several different types of CSV files that they can push. One is very detailed, that’s the one that we’re using. And then there are some that are less detailed, as far as there’s cost-allocation which is really rolled up already. It’s pre-rolled up based on different dimensions from product. It might be a monthly summary. There’s just various ways that you can do that. But that’s how we’re doing it.
Chris Churilo 00:24:53.278 Cool. So how did you guys come to the realization that a Time Series Database is really how you could get to the bottom of all these billing discrepancies?
Robert Allen 00:25:02.987 Yeah. I mean, that’s a good question. We originally looked at some tools like Druid and other forms of data processing tools. But, ultimately, where we ended up was really just looking at the data and realizing that it is-I mean, it’s very much time series data. The invoice is done on a line-by-line hourly. You have various facets of the data, be it a resource ID, an instance ID, your tagging, your custom tagging, your product, your action. I mean, for me, it was just an easy fit, and that’s what led us to that.
Chris Churilo 00:25:58.298 Now this wasn’t your first experience with InfluxData, right? You use it also for DevOps monitoring?
Robert Allen 00:26:04.509 Yes. So we’ve started using it-again, at the time we were using Prometheus, and we really like the-Prometheus, they call it labeling, but having those dimensions and just the organization of the data. But the problem that we had with that was we wanted a good, solid, reliable form of storage. We found Influx-as we’re just going through our research process of all this and what we found was with the SQL, the query, being able to use a SQL-like language, but also having all the things that we liked about this form of time-series storage. It seemed like a-it wasn’t a hard decision to make. Now, I think it’s important to note that, in many ways InfluxDB is still sort of new in that, it’s developing, it’s improving with every release. There are some things that you sort of have to know going in that it’s probably not going to be without problems but that doesn’t mean that it loses any of its value. It only improves over time. And one of the things that comes to mind is being able to-the cardinality issues, which are very much improving, and the 1.3 version is a testament to the fact that it’s ongoing and developing all the time.
Chris Churilo 00:27:51.796 Maybe you can talk a little bit about the containers that your engineering team are using. You and I talked a little bit about that when we met at MesosCon.
Robert Allen 00:28:03.697 Sure. So like I said before, currently, everything that runs in our cluster runs in Docker containers, and for us, the reason for that is really about putting more control and self-service ability of the engineering efforts in the engineer’s hands. My team, we’re called the Bedrock Platform Technical Services in Houghton Mifflin Harcourt and our rule of thumb is that we don’t exist to make Bedrock safe for engineers. I’m sorry. Man, how do you mess that up? It’s not our job to make engineers safe for Bedrock; we make Bedrock safe for the engineers. Now what that means is we have to provide an environment where hundreds of engineers can coexist and develop without adversely impacting the work of their peers. Part of that is that everything runs in containers. That gives them the ability to choose what version of Java they may run or if they’re running Java or the JVM, their environment variables, all of these things. We give them the ability to make those decisions. And because they’re in a container, it exists in an isolated environment. This also gives them the ability to have more control over what they collect as far as metrics, how they collect those metrics. Some may be happy with it collecting the metrics and the generic fashion that we provide out of the box in our infrastructure.
Robert Allen 00:29:54.436 In some cases, however, they may have a very specific way that they want to collect these metrics. So that’s where Telegraf for them really steps in. They’re able to run Telegraf in the container with their process, and it monitors it locally. A good example of this might be a spring bid application, which they may want to expose the management ports, but they don’t necessarily want those ports exposed, and so they expose it to local hosts within the container. It never is actually exposed outside the container. Telegraf is able to safely and in an isolated fashion pull those metrics down and then cement them up. That’s just one example.
Robert Allen 00:30:41.660 We also are looking at it, using it for telemetry’s KPI data, so as I mentioned, the rostering millions of students. In the past, we had collected a lot of this KPI information using Elasticsearch and how many students were rostered for a particular school district. And we do all this in a very mechanical fashion using log parsing, etc. Going forward, we’re actually capturing these things in Influx as events, and as counters, metrics, etc., using-because we now have the flexibility of high cardinality, we’re able to actually collect a lot of these metrics now and roll them up using, again, time series data. Because for me, some tools are built for time series data, some are actually just indexes. For me, Elasticsearch is an index. It’s an index first, database maybe second. That’s where I’m at with that.
Chris Churilo 00:31:54.411 Thank you. So the Q&A is open, same with the chat. So please put your questions in there. We’ll continue chatting. I can always ask lots of questions, but I also want to make sure that everyone else on the line gets a chance to ask Robert any questions about his implementations. You’ve already heard the three different use cases that he’s talked about, so please ask him about any challenges that they’ve had, any good things that came out of this, etc. Now’s your chance to talk to him. While we’re waiting for some of these questions, I also want to give a plug out to Robert. He’s always looking for great engineers to come and join the HMH team, so check out their career site. If you know anybody, especially, probably in the area that you’re working at, Robert, or are the engineers all distributed?
Robert Allen 00:32:47.991 They’re worldwide. Yeah. We have positions open worldwide, and my team is also worldwide. Currently, we have three engineers in Finland, two in Chicago. We’re looking for engineers in Dublin, the States: New York, Boston, Chicago. Yeah. We are pretty open to anything at this point.
Chris Churilo 00:33:16.261 So yeah. Everybody, if you know anyone that’s looking for a job, I’m sure between all of us, we have plenty of friends in all those various locations that Robert just mentioned. And then Robert is also going to be speaking at MesosCon in Prague, right, coming up?
Robert Allen 00:33:30.538 Sure. Yeah.
Chris Churilo 00:33:31.334 So if you happen to be in the area, you happen to be going to that conference, check out his talk. And then there will be another opportunity for you to also ask him questions about his implementation of InfluxData and the various use cases that he talked about. So I think why don’t we talk just a little bit more about Telegraf. So you guys actually created your own plug-in, if I remember correctly.
Robert Allen 00:33:58.295 Yeah, that’s actually one of the nice features of Telegraf that I was attracted to, is that the actual library-one is Go link, but the actual library for developing your own plug-ins is pretty straightforward and easy to work with. We are an Apache Mesos, Apache Aurora shop. That’s our primary mode of delivering application. We wanted good metrics for that. With the Prometheus world, we had used sort of forwarders that they use to collect the metrics from the API and then transform it into a Prometheus line curve. We wanted something similar to that, but I was never quite happy with how the data is organized. The way this data and Influx and other forms of time series data, the way you organize the data, it gives you a lot of room to do it wrong and it gives you a little room to do it right.
Robert Allen 00:35:05.658 And so what we did was we developed a plug-in for Apache Aurora, which didn’t exist at the time and it gave me the ability to go through and actually organize that data in ways that make sense. Because in reality, it’s the Apache Aurora metrics in-point is, well, bad. It’s not very well organized. The main structure, it’s evolved over time. Things just sort of got thrown on. It’s just not very well organized. And so by writing this plug-in, I was able to go through and really correct some of the organizational issues with that as well as pulling other endpoints of metric that in Aurora aren’t necessarily considered metrics. Things like quota consumption, pending tasks, these are all separate endpoints but the plug-ins, they will pull all of that together and then organize it in a way that’s much more usable over a longer period of time. That is not merged into Telegraf just yet, but hopefully it will be soon. And one of the other things that we are looking at building soonish, so we use Vault for secrets. We were preliminarily working on writing a Vault plug-in to manage InfluxDB usernames and passwords.
Robert Allen 00:36:40.959 So I don’t know if you’re familiar with that, but the idea is that you log in to Vault with your token or whatever form of auth. You make a request to Vault for a password or username-password combination, which is temporary. That way no application or individuals is actually holding secrets. It makes an account creation API call with whatever rules it has limiting it to a certain database, admin, etc., for a particular database in Influx. It returns those username and password, and then the user is able to log in. You no longer will need to configure a username and password in your application, or your Telegraf, or anything else. That’s all handled by Vault, then. And so it’s just a matter of renewing those credentials, as well as being able to revoke them. So when a person logs out or stops using, say, we’ll revoke those credentials.
Chris Churilo 00:37:42.766 Excellent. So I just had a thought. What I think-what I’m going to do is, when I post this video later on, I will start a thread on community.influxdata.com, so we can post our questions for this webinar. And Robert can see them and then answer them directly with anybody who might have some questions then-who is just a little bit too shy right now to ask them. Also, sometimes, I know after the webinar ends, then all of a sudden, your head gets flooded with questions. So this will give you an opportunity to be able to link up with him and find out more of the nitty-gritty details of his implementation. So we’ll keep the lines open for just a few more minutes, maybe somebody will not be so shy. I noticed that it’s either someone is very talkative with Q&A, or people go really shy on me. But, Robert, I do appreciate this presentation. I know when I met you and you started telling me about these use cases, I was pretty impressed with the various use cases. Hopefully, this will serve as inspiration for some of our other customers. That it’s not just for the typical DevOps metrics. It’s a great tool for collecting anything that you want to understand in real time or near real time, such as the KPIs that you’re collecting for your business opportunities, too.
Robert Allen 00:39:09.775 Right. There’s different ways to actually grab these metrics from the CSV file. I use a Python script that just sits there and pulls on the SQS queue for changes. Another way is a Lambda. You could easily do this as a Lambda function. I mean, it’s just different ways you can do it. I’m excited, as time goes by, I really feel that there’s going to be a lot of different ways that we’re able to use time series data in ways that we haven’t even thought of yet.
Chris Churilo 00:39:52.529 Well, awesome. Well, thanks again for joining us today. And if you have any engineers that would like to join the HMH team, reach out to Robert. I’m sure they’ve got a lot of interesting projects. As you mentioned, it’s not the same 200-year-old company as it was 200 years ago [laughter]. They’re doing some pretty innovative stuff. And Ovid says, “Thank you.” So we got at least a thank you from the crowd [laughter].
Robert Allen 00:40:21.703 Thank you.
Chris Churilo 00:40:22.915 And once again, thanks everyone for joining us. And we will be posting this webinar on our website under our Resources section. We’ve been slowly changing the website to make it a little bit easier to find these webinars.
Robert Allen 00:40:38.371 And log in, I hope.
Chris Churilo 00:40:39.641 And log in. Yes, I put that in there, Robert. It’s in there [laughter]. Definitely.
Robert Allen 00:40:48.022 Yeah, so we know.
Chris Churilo 00:40:49.743 Exactly. We will fix these issues. Once again, everybody, thank you so much. And I hope you have a wonderful day.
Robert Allen 00:40:59.757 Thanks y’all.
Chris Churilo 00:41:00.908 Thanks. Bye-bye.