A Benchmark Review of InfluxDB and MongoDB
In this webinar, Robert Winslow compares the performance and features of InfluxDB and MongoDB for common time series workloads, specifically looking at the rates of data ingestion, on-disk data compression, and query performance. Hear about how Robert conducted his tests to determine which time series database would best fit your needs.
Watch the Webinar
Watch the webinar “A benchmark review of InfluxDB and MongoDB” by clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “A benchmark review of InfluxDB and MongoDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Robert Winslow: Consulting Developer
Robert Winslow 00:01.650 Hello everyone, welcome to another benchmarking report from InfluxDB. Today we’re going to be taking a look at Mongo, and this is going to go over the methodology we use to compare both of these databases. At the end, we’ll get into a little bit of specifics about some of the things that we did to achieve the best performance in Mongo. So we want a question to frame what we’re trying to do with all of these performance experiments. And that question is, “Is InfluxDB faster than Mongo for workloads you actually care about?” I’ll get into more detail about that in a moment, but let me go over the goals of the project. First, are our benchmarks realistic? Can we answer performance questions for real-world users, not just database engineers? We want to compare workloads that are close to the real thing and we do that through a simulation-based approach, that I’m going to dive into in this talk. And that simulation-based approach models a DevOps use case. We’d like our experiments to be rigorous. To us, that means that the query results that come back from each database should be correct and identical. And we’ll get into a lot of detail on that as well. Of course, it should be easy to reproduce our results. And all of our code is open so that the community can offer feedback and suggestions. Finally, we want to be fair. We want to show Mongo in the best light. Part of my role in this as an external contributor is to try to be unbiased, and we learn from MongoDB experts how to get the best performance. Our schema design handles the general time series use case. It works well for all ingestion rates and it supports ad-hoc query using tags. At the end, we’ll get into some of the choices that we’ve made to achieve that.
Robert Winslow 02:04.188 Now let’s get into the methodology. There are five phases and I’m going to go through them one-by-one. Each time we perform a benchmark we go through these phases. The first phase is to generate data. The DevOps data generator creates time series points that are modeled after real world data and server fleets. This is something that we know a lot about because we also work on Telegraf at Influx, so we took the metrics from that and created a simulation that models that data. In the benchmark suite, that program name is bulk_data_gen. All of our simulated data is generated using a pseudo-random number generator with a configurable seed, which means we can reproduce any of our benchmark runs with just a few command line flags. In our simulation, in the data that we’re generating, we have nine different measurements that we simulate through time. As you can see here, there are a variety of them, but in general, there are two types. One is the system-level telemetry, so CPU and disk and memory and network behavior. And then the second is application-level telemetry. Here we have three applications running on this simulated server, Nginx, PostgreSQL, and Redis. In the simulation, every simulated server has 10 tags that stay the same through the simulation, and these are what identify it.
Robert Winslow 03:41.300 Here we took them from, again, what Telegraf gives us. So we have the machine architecture, the data center, the host name, the operating system, and so on. And again, we sign these tags once to every simulated server and then they are constant throughout the rest of the simulation. In particular, the values that we assign to these are also fairly realistic, so the operating system is choices amongst like ubuntu 16.04 or 14.04, and where applicable we used AWS labels to try to be as realistic as possible. Finally, each simulated point has values. So far, we’ve looked at the name of the measurement, say CPU, instead of tags that identify this machine, and then now we get into the actual data. So here we have a set of example simulated values, these are all in the CPU measurement. And the way that we simulate most of these values is not just by picking a random number. We actually use what’s called a random walk. And we found that that’s a much more realistic model of what machines actually do. And again, this is part of our attempt to be as realistic as possible and as well as rigorous. And here’s an example of the random walk for CPU usage percentage through time. As you can see the deltas between time points are-they model more what you would see in a real system. This is important practically because if you have uniformly distributed data as opposed to random walk, you’ll get different compression behavior.
Robert Winslow 05:28.760 So InfluxDB has a mechanism to compress data that looks like this, which is realistic. Every simulated data point has a timestamp. Here we see a nanosecond since the beginning of UNIX time. For Influx, nanosecond precision is available. And for Mongo, our schema uses nanoseconds. As the simulation ticks through time emitting these telemetry data, the ticks are in 10 second epochs. So here’s an example to bring the point all together. So this is one data line in the InfluxDB line protocol format, and it has four parts. So we have the measurement name, it’s CPU. Machine specific tags which, again, are static, the host name, the region, the data center, and so on. The fields which are all a subset of what we’re measuring about the CPU, so what’s the user usage in percentage? What’s the system usage in percentage? Then you can see that these values all lie between 0 and 100, and so they are percentages. And finally, we have the timestamp in nanoseconds since the unit’s epoch.
Robert Winslow 06:50.079 The data generator serializes points in different formats. For this talk, we’re concerned with Influx and Mongo, and so the formats we use for both of those are for Influx, we use the HTTP line protocol which you just saw an example up. And for Mongo, we use the bulk write format which encodes in BSON. And we use the standard mgo library for GOLANG. One point here is it’s important to try to generate and serialize data before performing a bulk write benchmark. If we do it synchronously, the times actually encoded into these data formats could pollute our timing data. We try to minimize that. With Influx, we’re able to do all serialization ahead of time. For Mongo, we’re able to reduce some of the serialization. So we store MongoDB data points as a series of flat buffers objects. Flat buffers is a zero allocation, very fast data storage format. And so this is our attempt to reduce serialization overhead when running this benchmark. Concretely, for this benchmark-and there’s going to be a report online that you’re welcome to read-we generated the DevOps-1000 data set, and the way to interpret that is we have a simulation of models that DevOps, in this case, and we picked 1,000 simulated servers as part of that. So the simulation is DevOps data, 1,000 servers.
Robert Winslow 08:27.120 So as a summary of the generation of data phase, datapoints are generated from a simulation of servers. Every 10 seconds, each host emits 100 field values across 9 measurement points, and we save the output in a zstd-compressed file. On the part two of five, we’re now going to take the data we just generated and we’re going to load them into the databases benchmarking results. So we’re going to bulk load data, pull it into from standard input, and measure the write throughput. Each loader program combines input into batches, and then they send them to the databases as fast as possible. Of course, parallelism is important in these programs. We have one bulk loader program per database. This lets us specialize. In this case, it was necessary because the actual libraries that we use to write to Influx and write to Mongo were different. And here, the program names were bulk_load_Mongo, bulk_load_Influx. And I give you those so that it’s easy to go find in the source tree on these executables and read the source code. And they both take configurable parameters. In common, they both have parallelism and batch size.
Robert Winslow 09:43.386 When loading data into Mongo, we use the bulk write API, and again we use the mgo library to achieve this. Just one little caveat, mgo forces us to do the BSON serialization at benchmark time. As I mentioned earlier, we prefer to serialize ahead of time, but we also prefer to be realistic, so we use the off the shelf library, mgo, instead of writing our own client. We believe this represents the use-case of the vast majority of Mongo users. Please offer your feedback on the MongoDB loader. We did our best to make it fast and fair. On the InfluxDB side, the loader is a straightforward HTTP client. We use the Go library, fasthttp, to achieve the best performance that we could, especially compared to the Go standard library. Fasthttp is much faster. So each of these programs, after they are done loading data, give a small summary of what happened. So for the bulk loader for InfluxDB, the example output I have here shows that we loaded a certain number of items in so many seconds with 16 workers, and it gives you the average rate of lines per second. And in this case, it’s the lines per second, the bulk line protocol, that Influx uses. In this case, each line can contain multiple values, so this number is actually much greater than what you can see here.
Robert Winslow 11:25.566 Moving on to part three of five, we’re now going to generate queries that we will use in part four to query the data that we’ve loaded. So we’re going to create these queries again with the simulation-based approach and save them to a file. Like the data generator, the query generator the logic is shared between the databases and it lives in the program bulk query gen. Many benchmarks suites generate and serialize queries at the same time as running benchmarks. As we discussed a moment ago, this can pollute timing data. And we tried to minimize those effects to these benchmarks. Each query comes from a template and each of those templates have variables that we use-that we fill in with values from the simulation. Most of them involve, primarily, a start time and an end-time query as well as server names that are used as filters. Other parameters are the query type, for example, max value, mean value and the grouping interval. For example, one minute or one hour. And just here are some examples of query types that we have had in the past with this benchmark suite, so what’s the maximum CPU usage over four random, simulated hosts during our random hour groups by minute? Or, what’s the mean CPU usage over the whole server fleet during a random day grouped by hour and hosting? And the code base is fairly extensible. We have a number of different queries that we use to run our tests.
Robert Winslow 13:16.846 So getting more concrete, we are going to look at now an example MongoDB query from the query generator. So Mongo queries are serialized in the BSON format. Here is a lightly edited, human readable version of that. So this is page one of two. So this is a Mongo query that looks at a time-series collection and computes a basic aggregation over it. So you can see here that StartNanoseconds, EndNanoseconds, FieldFilter, HostnameFilter, and TimeInterval are all variables. And so these will be filled in. This query structure will be the same, but during the query generation phase, all of these template fields will be filled in with different values. And here’s page two of two of the template ties query. There are no variables here, but I just want to show you this to get a flavor for how it looks to make time-series queries of Mongo. And here is the same query template but filled in with values this time. So you can see that we’ve used nanoseconds for the greater than or equal field and the less than field. We have some tag filters, so here we’re looking at the hostname that matched host underscore one. So in the simulation, that would match exactly once over.
Robert Winslow 14:40.985 And then finally near the bottom, we have what is the time bucket. So this is an hour but in nanoseconds. So it’s 1 billion nanoseconds times 3,600. And again, page two of that query. In contrast, Influx queries are quite a bit shorter so I can show them-the both versions on the same page. So here’s an example with template variables at the top. We are selecting the maximum over the usage user field from the CPU measurement where time is greater than something that we’ll provide and time is less than something we’ll provide. And hosting matches a variable and we specify a time interval to group by. And then the second box here shows that concrete query and it reads pretty much how you would describe it maybe to a human or as you would write pseudocode, which I found very convenient to use. So when we generate queries, we will be saving them to a file. So we save each query as a GOB-encoded value using that protocol, which is the sort of default Go binary and coding for Go types. We can decode about 700,000 requests per second when we’re piping data to the query benchmarker, which we’ll talk about in the next section. This hasn’t been a limit for single-node benchmarking, but we will be benchmarking clusters, and so we have plans on how to speed this up.
Robert Winslow 16:23.900 Like the simulation for data generation, the query generator uses a deterministic, random number generator to fill in the parameters for each query. So, just like the bulk data generator, its output is 100% reproducible. Also, note that the query generator for specifying start and end times-it picks random start times. And the resolution is down to, I believe, a second or millisecond. And so the idea there is that we prevent databases from caching query results. And you can read more about that in the source code. And finally, when you’re using this program, here are some with the example output you’ll see. So if you’re generating Influx queries it will show you the random seed that it used, which of course you could configure manually if you like, and the type of query, and how many points it generated, and identical output for Mongo as well. So here we can see that we created 10,000 max CPU queries that range over one random host in an hour by a minute. And here’s a link to the source code. Again, please provide a feedback and suggestions.
Robert Winslow 17:38.245 On to part four of five, we’re going to actually take those queries that we just made and we’re going to benchmark them by running them against both databases. So we’re going to run them as fast as we can against each database. Just like the other tools, this will be using standard IO, and queries can be made in parallel. Similarly to bulk loading, there are separate query benchmarker programs. The names of them here are Query Benchmarker Mongo, and Query Benchmarker Influx. And they both share parameters, parallelism, and the destination url. And also just like with the bulk loading, we use the mgo client library for our Mongo querying. And so when the queries get run, of course, there’s a lot of output depending on the level of verbosity. But at the end, what you get is a summary of how the queries went and how fast they were. So this is an example from our benchmark run, so I’ll go through it line by line to help you interpret it. The first line says: “Burn-in complete after 100 queries with 4 workers.” Now, burn-in is something that is used to stabilize the behavior of the database. These are queries that are executed and then thrown away. They’re not used for collecting statistics. This is primarily important for two reasons in my mind. One is for applications using something like the JVM, where you would like to give the just-in-time compiler a chance to warm up. And second, to give the database an opportunity to request data from whatever its files are and have them be memory-mapped or otherwise cached in RAM.
Robert Winslow 19:25.685 The second line says: “Run complete after 5,000 queries with 4 workers.” So we ran 5,000 queries, sort of spreading them out uniformly amongst 4 workers that were hitting the database in parallel. The next two lines give summaries of the query response times. So this shows that a particular query for Mongo, which is about asking for the maximum CPU usage for one random host from the server fleet for a random hour group by the minute. It took an average of 3-and-a-quarter milliseconds which translates to about 300 queries per second, and we did 5,000 of them. And then the next line shows just a summary of that. The benchmarker can actually run more than one type of query simultaneously. And then finally, the wall clock time took about 4.3 seconds. For Influx, just like with the Bulk Loader, we used the fasthttp library. And the output for Influx query benchmarking is very similar to what it is for Mongo, and you’re welcome to read it here.
Robert Winslow 20:42.791 So now we’re going to go to the last phase of the methodology explanation. And this phase is definitely my favorite, and it’s where we’re going to validate that the results are correct between the databases. So I mentioned at the beginning of the talk that we have a number of requirements, a number of goals, and primary amongst those is, are we asking the same questions of each database? And so we have developed a way to validate that the results that we received from the query benchmarkers for different databases match our expectations, and are correct and identical. We care a great deal about being correct. The query validation phase answers the following questions; were the parameters in the points simulated from the simulation identical for both databases? Were the points loaded correctly into both databases? Were the parameters that we put into the queries identical for both databases? And were the query results identical? And which would show that the queries that we submitted were semantically identical? So the validation phase checks that the floating point values, the numerical results from the queries to different databases, are identical. And so here’s an example. So we’re going to compare the results for the same query. And the way this works is there’s a special flag for the query benchmarker that will show you the query results and so we are able to compare them side by side. And so for a given query, we have here a slice of the response for Mongo and a slice of the response for InfluxDB. And you can see that the data is the same. So for the Mongo lines, it shows January 1st, 2016 at midnight is 16.64 and January 1st, 2016 at 1:00 AM is 31.869.
Robert Winslow 22:40.980 And for Influx, going down in the second box, you can see it’s exactly the same. It’s January 1st at midnight, 16.64 and January 1st at 1:00 AM, 31.869. And so this is how we kind of show that the whole end-to-end process of benchmarking is doing what we expect. And this level of rigor has allowed us to move fast with confidence-move quickly with confidence when making these benchmarks. Of course, it’s not perfect. As you maybe noticed in the previous slide, the floating point values can become a little off as you go into more precise numbers. We’ve figured out ways to handle that, mainly by just setting a tolerance. Right now, validation is a manual process. We have thoughts on how to automate it. Probably the most interesting would be to actually use the simulation itself to generate the query results. So that would involve running the queries against a kind of a gold database that is very simple in-memory object. And that would give us back the gold query results. And then, we would use the results from the real queries executing as databases against that.
Robert Winslow 24:01.270 So now to conclude, I’d like to ask, did we meet our goals? Well, we were trying to be realistic. Our data and our queries are modeled on the real world DevOps use-case, and our simulation approach creates plausible time series data for a whole server fleet. Were we rigorous? Well, query validation proves correctness end-to-end. We are certainly reproducible using configurable pseudo-random number generator seeds. We have deterministic data and query generation. We’re open. All source code is available on GitHub under a permissive license. And were we fair? We used Mongo in the most favorable way we knew how.
Robert Winslow 24:50.331 So I just talked for a while about methodology and I want to spend a quick moment talking about some of the choices we made with Mongo. I will note that it was a little bit tricky to get the schema from Mongo to meet what we felt were our standards. In particular, all I claimed was that there are no best practices for Mongo that satisfied all our requirements. There are certainly a lot of blog posts and white papers about this, but they tend to make certain assumptions about data rate ingestion and how the data will be queried. In particular, I often saw advice to increase throughput and minimize disk usage by batching points together into fixed-width arrays. So in Mongo, you might have a document collection and there are many ways to model time series in that. And so one way is to have a time series value per document which gives many documents. Or you could have multiple time series values per document, and so you have sort of a user-created, very shallow tree and not that case. And often the best practices that we saw sort of made assumptions about-let’s say that you were expecting so many points per minute and that was about always the same, then you would stick 100 points per minute into a document and then move onto the next document and the next document. And so each document would represent a time slice or a time shard.
Robert Winslow 26:18.563 This was very unsatisfactory to us because it assumes a lot about the nature of the data coming in, and we felt that that was an unrealistic assumption. And so that was one of the issues that we ran into early on. Second, it is possible to increase rate throughput in Mongo by using multiple collections. In our case that would have mapped to one measurement per collection. So for example, CPU would be one collection, Redis would be one collection, Nginx would be one collection. But in the general time series use case, clients of a server or a server cluster can be expected to create arbitrary new measurements. And so what this means is that clients are creating new document-would create new document collections in Mongo. And we actually found that when we had two parallel workers trying to write the first point in a collection each-so let’s say you have two parallel requests coming in to write a CPU value-if those two tried to write at the same time, they actually ran into contention issues with Mongo. And we became worried that we might be shooting ourselves in the foot in terms of correctness, that that would lead to data races, in particular, might lead to bugs that we may not know about. And so we felt that that was not robust enough to represent what a production Mongo solution looked like. And so we moved to a single collection mode.
Robert Winslow 27:56.901 Finally, we looked at interning tag strings so as to use just integers. The notion here was that we’d like to minimize the space used by each document. So in our case, a document is a value. And so we have a measurement, and associated with that, are some tags-for example, the machine’s name or the data center it’s in, and then finally the field name, say CPU usage user and then the value, say 50-we would like to minimize the space used by all those fields. And so it’s possible to do what’s called interning where you keep a collection of tag strings that you’ve already seen and you map them to unique integers and then you use those unique integers when writing the datapoints. So you should save space. We found that due to the lack of transactional support at Mongo that that did not work very well. In particular, the same worry about the data races was there. And so it was very tricky to try to get unique IDs assigned to each tag because writes could come in at the same time and overwrite each other. This is an effect where we want a shared serializable data structure and memory, not a collection with duplicate values. We really want a unique mapping from tag IDs and integers. So we used the actual strings. Of course, Mongo provides snappy compression and so we think that the space overhead of that was somewhat minimized. But certainly, that could result in higher disk usage. So these were some of the few kinds of caveats or issues that we ran into in designing a Mongo schema. Okay, thank you very much.