Let’s Compare: Benchmark Review of InfluxDB and MongoDB 3.6.2
In this webinar, Ivan Kudibal will compare the performance and features of InfluxDB and MongoDB for common time series workloads, specifically looking at the rates of data ingestion, on-disk data compression, and query performance. Come hear about how Ivan conducted his tests to determine which time series db would best fit your needs. We will reserve 15 minutes at the end of the talk for you to ask Ivan directly about his test processes and independent viewpoint.
Watch the Webinar
Watch the webinar “Let’s Compare: Benchmark Review of InfluxDB and MongoDB” by filling out the form and clicking on the download button on the right. This will open the recording.
Here is an unedited transcript of the webinar “Let’s Compare: Benchmark Review of InfluxDB and MongoDB.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
• Chris Churilo: Director Product Marketing, InfluxData
• Ivan Kudibal: Co-founder and Engineering Manager, Bonitoo
• Vlasta Hajek: Software Developer, Bonitoo
• Tomáš Klapka: DevOps Engineer, Bonitoo
Chris Churilo 00:00:00.665 So welcome everybody. My name is Chris Churilo. I work for InfluxData. Today, we have my friends Ivan, Vlasta, and Tomas that’ll be presenting. These guys are from a company called Bonitoo. Wanted to make sure that I had an independent consultant run the benchmarks. So we try to be as unbiased as possible. And today they’re going to be reviewing the work that they did to conduct the benchmarks as well as the results from the benchmark. So with that, I will hand the mic over to the Bonitoo guys.
Ivan Kudibal 00:00:33.390 Thank you very much, Chris. Hi everybody. My name is Ivan Kudibal and I lead the small company that is located in Prague. The company consists of software engineers. And, well, today at this webinar I would like to introduce Vlasta Hajek, Tomas Klapka. They’re engineers, and myself. What we have prepared for you today is the comparison of the performance of InfluxDB and MongoDB in the context of a specific DevOps use case, the monitoring use case. So basically, we refreshed the benchmarking efforts that were first conducted in 2016. The framework is available at github.com as an open public repository. So you’re encouraged to use it.
Ivan Kudibal 00:01:44.390 MongoDB, just to say in the beginning, is the document-oriented database, classified as a NoSQL database. InfluxDB is a Time Series Database, and in general, it is designed to store the time series data and measurements. The structure of this webinar consists of four parts. The first one is this introduction to the InfluxDB comparison framework. This is going to be presented by Vlasta. Then we would like to show you a short demo of the framework so that you can follow us and try the framework around other comparisons, or a specific database on your end. In part three, we’ll present the benchmarking and report, is going to be presented by Vlasta again. Finally, we will end up with conclusions and a Q&A section. So with this, I’m giving my word to Vlasta, that will introduce you to the framework and the basic methodology.
Vlasta Hajek 00:03:08.491 Okay. Thanks, Ivan. So as Ivan told you, we were running benchmarks as they were run in 2016. Now with the latest versions of InfluxDB and also with these versions of MongoDB. The previous measurements were done with InfluxDB 1.0 and MongoDB 3.3.11. Now we have InfluxDB 1.4 and MongoDB is 3.6. I will briefly describe the methodology and the framework used to benchmark InfluxDB, and MongoDB, and some data. The methodology and framework were originally designed and developed by Robert Winslow, and we’ve used it without any major modifications. Robert already in detail explained the framework and the testing approach in his webinars, which were around almost a year ago. And if you would like to hear more details, I strongly advise you to watch those webinars. You can find them easily on the InfluxData website.
Vlasta Hajek 00:04:44.858 So we can spend hours with discussion about the benchmarks and the methodologies. Of course, there will be common properties that each approach must hold, especially it must be realistic. It must simulate some real-world usage. The benchmark must be fair and unbiased to all databases. And it must be [inaudible]. In our case, we’ve selected the real-world use case when there is a DevOps engineer maintaining a fleet of tens, hundreds, or maybe even thousands of machines which on this board showing metrics gathered from those servers. I mean, hardware metrics such as CPU, kernel memory, disk, network, and some application metrics about Nginx, PostgreSQL, and Redis. Those metrics are collected by agents running on the servers, let’s say by Telegraf. And they are sent to a central DB. If those metrics would be gathered each 10 seconds from hundreds of servers, we would see a most continuous stream of data going into database. So it means that measurements have from 6 to 20 values and it meets 11.2 values in average as there is 9 measurements. It gives 101 values in total per single host. And those values are written during each data collection.
Vlasta Hajek 00:06:57.102 Each measurement has also 10x where mostly system info and data center meta data are based, such as name, region, data center operating system, region of the operating system and so on. These metrics are then generated using pseudo random generator with employing random work algorithm to achieve the reproducibility and the variance of data. So datasets using those tools can be generated for any length. And for basic comparison, we’ve used dataset with 100 hosts over the course of a day. So what is actually measured in benchmarks? For the time series use case and also for other types of data storage, it’s definitely important to benchmark the data rate. That’s how quickly can database just load to the database and what gives out the view on how much data can database handle at that time. We measure it in values per seconds and the higher, the better. It’s also important to know how the data transfers to disk, and what is the final cost, and how efficiently the database engine uses the provider disk space. Here the less the better.
Vlasta Hajek 00:09:02.482 And of course, we would like to perform queries and read the data as fast as possible to have our dashboards read metrics fast, be and responsive. This we measure in time interval during which is query processed and then we calculate possible number of queries per second. The higher query rate, the better. In our case, as we use time storing series data, there are no updates in the recent [inaudible] measurements. About the benchmarking framework. It’s written in Go language and it consists of a set of tools specific for using in different phase of benchmarking. And it’s different for each database. At first phase, we generate the input data in a wire format specific for each database. For InfluxDB, the data are basically plain text. For MongoDB, we store data points as a series of flag buffers, objects. For buffers is a zero occasion so very fast data storage format. And so this is our attempt to reduce the serialization overhead when running the benchmark.
Vlasta Hajek 00:10:49.494 Then we use the pre-generated data and transfer it over the network to database. And to achieve the best performance, we used fast http library from InfluxDB, where data are sent over http protocol. And for MongoDB we use the bulk write format which encodes in the BSON. And for this we use standard mgo for Go link. And similarly for queries. In formats specific for each database, we generate various queries differing mainly in the [inaudible] condition and use the data to feed the query benchmarking tool, which sends the queries to the database. And it measures the response time and calculates the mean marks and minimum values. For achieving the best performance, the query benchmarking tool doesn’t validate the results. But it’s possible to use a special debugging option which allows to print reformatted response and compare the responses from both databases.
Vlasta Hajek 00:12:24.349 Here’s an example of InfluxDB data in that InfluxDB uses so-called line protocol format, which also nice, humanly readable and writable. It consists of four parts. So name of the measurement, a set of tags which are indexed. Then a set of fields which are typed and are not indexed. And finally, a timestamp by default in nanoseconds. So the InfluxDB uses the nanoseconds precision by default. And this is what we actually simulate also for Mongo. This complete row defines what is called, in InfluxDB terminology, point. And several points in the same measurement defines the series. And such an [inaudible] point allows writing—as you can see, it can write several values at a time. So maybe it could be understood that InfluxDB has a fixed schema. By schema, I mean measurement, name, tags, fields, and timestamp. However, it’s flexible as regards to number of tags or fields. Any point can add new tag or field or tags can be totally skipped, required is just at least one field.
Vlasta Hajek 00:14:10.715 Here we see examples of memory, network, and also PostgreSQL points. A few notes to MongoDB’s schema design. As Mongo is general purpose, no SQL database, it requires some small upfront decisions how data will be stored. There are values, best practices, tech papers, or blog posts that guides how the design a Mongo schema should be. However, many have some assumptions that doesn’t fit the general time series purpose, by our understanding or by our opinions. And so our design is such a—instead of having multiple collections, we decided to store all values in a single document collection. So each documents in collections stores a single value, which is defined by its name and a group of tags. Timestamp is, as previously said, it’s 64-bit integer storing nanoseconds value. And finally, the document database uses a Snappy compression.
Vlasta Hajek 00:15:51.897 You should watch it definitely, Robert Winslow’s webinars. He’s speaking in more detail about MongoDB’s human design decisions and [inaudible]. So here is an example of MongoDB document. As described, before each value is inserted in the separate document, note the series name, then the field name, timestamp in nanoseconds, set of tags. Here is just example of the tags, key, value pairs, for just getting idea of how it looks like, and finally the field value. The series name, field name, timestamp, and tags are indexed. So here is example of InfluxDB. For measuring query performance, use query. That would be a typical example of source of data to dev ops and monitoring tools. As we’ve touched in the general introduction through the methodology. In this case, the queries are extra, maximum CPU usage for a given host over the course of a random hour interval grouped by one-minute interval. So InfluxDB uses InfluxQL query language which is very similar to SQL and is quite easily writable and readable.
Vlasta Hajek 00:17:55.229 Mongo queries are also serialized in BSON format. Here is just JSON version of that, and the first page of two. It’s Mongo query that looks like at the time series collection and computes the basic aggregation over it. So you can see how we used nanoseconds for precisions, nanoseconds precisions, and to look for the greater than, or equal fields, and the less than field. We have some tags filter where we actually look for host name, named host of the score nine, and the field value which we want to select. And finally, near the bottom, we have the time bucket. So this is an hour but in nanoseconds. So it’s 1 billion nanoseconds times 3,600. At the page two, it is just the grouping part and the sorting part. So as you could see, in fact queries are quite a bit shorter, and also probably easier to write. So as you now hopefully know at least a little about the methodology and the framework, so let’s see how easily it is for usage, and Tomas will show you the demo of using benchmark tools for Mongo.
Tomas Klapka 00:20:08.610 Okay. Thank you Vlasta.
Tomas Klapka 00:20:26.365 In following couple of minutes, I’m going to show you a simple practical demonstration of the benchmark’s comparison tools. And so today, I’m going to benchmark the MongoDB 3.6, which I have already prepared on my local host. It runs in local container, as it is the easiest way to get it online in short time. Firstly, if you don’t have the Golang binary already installed on your machine, feel free to download it from official website. There are also useful guides how to install and get it working on most operating systems. So after setting up your Go installation, it’s time for getting the common line tools from a remote git repository. I’m going to use the Go Get command, which is the standard way of downloading and installing packages in their independencies. And in this example, you will need the four tools from our repository. And after Go Get is done with all the magic, it will be ready to go.
Tomas Klapka 00:21:48.696 So at first, to generate some data, we need the bulk_data_gen tool. So let’s install it. For loading our data sets to the database, you will need the bulk_load_mongo tool. And similarly to the first bulk_data_gen example, we need something for generating queries like this. And finally, the query benchmark tool for Mongo. Now let’s have a look at the data generation and ingestion commands. The bulk_data_gen right here will generate the data set according to all those input parameters. And then it will use it as an input stream for the next command called box load Mongo.
Tomas Klapka 00:23:45.204 It takes approximately 10 seconds. Okay. Now we get a result of approximately 17 seconds and with two workers, as you can see the parameter here. And mean ingestion rate is about 12,000 values per second.
Tomas Klapka 00:24:26.363 Okay. After the main data loading, we can finally benchmark out to queries. So I’m going to show you my blank command.
Tomas Klapka 00:24:49.945 As you can see here, we use the bulk_query_gen tool which will generate the query set according to those parameters. And then it will send it through the query benchmark in Mongo, which has 100 queries as their [inaudible] set and uses my local MongoDB installation. In this case, we are just one worker. Sorry, there is a little mistake. 1,000 will be better. Okay, now we have some results. As you can see here, we benchmark by the whole CPU metric, or max CPU, and for one host only. And it was randomized by one hour and grouped by one minute, with a mean time of three milliseconds per query. And then converted to minute, it gets us to the value of almost 319 queries per second. Okay. That’s pretty much all, and I hope that this showed you how easy it could be, if you want to try it from scratch. And, yes, there might be a quite high variation in all those parameters, but, however, they can—in the future, they can help you with adjusting all inputs, data and request, to fit all your needs. So feel free to try it. All those tools are publicly available in our Influx benchmarks comparison Github repository. Thank you for watching, and I pass the word back to Vlasta.
Vlasta Hajek 00:27:49.960 Okay, thanks, Tomas. So let’s now see how we did benchmarking, and what are the results. So we used both databases in the single host deployment, both the default installations without any tweaking. And there was also no daemons in the memory, except for the Telegraf for correcting the system metrics. But it was at least no remarkable [inaudible] benchmark on the system. For running in the single host deployment, you can see that both databases were similar, easy to install, and run.
Vlasta Hajek 00:28:58.472 In our benchmarking, we’ve run benchmarks on two types of hosts. Cloud-based virtual hosts, and on-prem is by a virtual machine. We wanted also to validate whether someone should be worried about the performance of the cloud-based virtual machine, comparing to bare metal machines, when following the nowadays trends to go to cloud. So we had HP [inaudible] available with Intel Xeon E5 2640 version 3, running at 2.6 GHz with 16 cores and hyperthreading. Only drawback of this machine was that it has SCSI hard drives. In AWS, we’ve chosen C4.4 X large machines, because it has similar parameters, just with a little faster CPU, and EBS SSD drives. The results in this comparison is that AWS gives comparable as used to Bermuda ones, even it was a little faster, taking advantage of SSDs.
Vlasta Hajek 00:30:38.370 So the results. We are comparing InfluxDB 1.4.2 and Mongo 3.6. We used the dataset that simulated collecting metrics over the course of 24 hours from 100 hosts in 10 minutes intervals. In 12 hours we used full parallel threads that are used to send [inaudible] data. Querying was done as shown in the example, maximum CPU usage for given host over the course of any random hour interval [inaudible] by one minute. So InfluxDB outperformed MongoDB in all time series use case, we can say, by order of magnitude in all metrics. InfluxDB has almost 57 times better ingestion rate. It’s 115 times better disk saver. And finally, it responds 11 times faster to queries. All databases showed quite a good vertical scalability to what we’ve observed. But just to repeat we used the default instruction and we’ll be comparing those. So we can imagine that with some little tweaking even with some cluster [inaudible] to give you—we could get different results.
Ivan Kudibal 00:32:32.559 Well. This is Ivan. So what can we say in the part titled conclusions? So believe me that MongoDB is the database number one as for the application development in our company. But actually, I wouldn’t choose this MongoDB as a database for the time series use case. At least, when we think of the time series and the DevOps of let’s say hundreds, thousands VMs, post the data via Telegraf to an InfluxDB it seems that TICK Stack is a number one selection. And, well, even if MongoDB is a flexible engine that stores the document until you really can achieve a lot of —all of use cases, when you implement definitely the recommendation of —I wouldn’t recommend you to use any MongoDB application developments tech for the time series and monitoring use case.
Chris Churilo 00:36:31.329 Thanks, everybody. If you do have any questions, please feel free to put them in the chat or the Q&A panel. We’ll make sure that we’ll get those answered for you. Hopefully, you found this information useful. And if you do want to take and look at a little bit more of the details about our benchmark tool itself, you can in the GitHub repo. And we’ll post the—similar to what we did with Cassandra over the last couple of weeks, we’ll post the presentation to SlideShare and we’ll get these videos up as well. So we’ll leave the lines open for a few minutes and if you have any questions, don’t be shy. Please feel free to post them.