Let's Compare: Benchmark Review of InfluxDB and Cassandra 3.11.1
In this webinar, Ivan Kudibal and team will compare the performance and features of InfluxDB and Cassandra 3.11.1 for common time series workloads, specifically looking at the rates of data ingestion, on-disk data compression, and query performance. Come hear about how Ivan conducted his tests to determine which time-series db would best fit your needs. We will reserve 15 minutes at the end of the talk for you to ask Ivan directly about his test processes and independent viewpoint.
Watch the Webinar
Watch the webinar “Let’s Compare: Benchmark Review of InfluxDB and Cassandra 3.11.1” by filling out the form and clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “Let’s Compare: Benchmark Review of InfluxDB and Cassandra 3.11.1.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: - Chris Churilo: Director Product Marketing, InfluxData - Ivan Kudibal: Co-founder and Engineering Manager, Bonitoo - Vlasta Hajek: Software Developer, Bonitoo - Tomá Klapka: DevOps Engineer, Bonitoo
Chris Churilo 00:00:00.210 All right. Once again, welcome everybody, for joining us at our webinar this week. We will be sharing our benchmark results of InfluxDB and Cassandra. My name is Chris Churilo and I’ll be your host today. Just want to remind everyone, put your questions in the Q&A or chat panel. We’ll get them answered at the end of today’s session. Since we do have three speakers that will be talking today, it’ll be a little bit easier to coordinate the Q&A section at the end. We will also be posting a partner tech paper that goes with the webinar so you’ll be able to see the benchmark reports, and everything will be posted on our website later. So with that, I’m going to pass the ball over to Ivan.
Ivan Kudibal 00:00:42.752 Thank you very much, Chris. So good morning, good afternoon everybody. Welcome to the benchmarking InfluxDB versus Cassandra presentation. My name is Ivan Kudibal. I am head of the Bonitoo company. The Bonitoo company is an independent third party of engineers from Prague. There will be three speakers today, myself, and then Vlasta Hajek, an engineer, a senior engineer, and Tomas Klapka, the DevOps engineer. We are going to show you our comparisons that we created in order to provide unbiased comparisons of the benchmarks of InfluxDB and Cassandra. This is the second presentation. You’re welcome to watch the InfluxDB in the last research report that we did prior to this webinar. And this is also-the Cassandra is not in the last webinar that we are going to have.
Ivan Kudibal 00:02:04.167 So what was the mission? So we were asked by Chris to perform the benchmarks. We actually refreshed the benchmarking efforts that were conducted back in 2016. So we used the existing testing framework called InfluxDB Comparisons that is available publicly at github.com. And you are welcome also to download, compile, and use this InfluxDB Comparisons framework for your proposals. Today we are also going to show you how to use that framework. Well, some few words about Cassandra. It’s the general purpose, NoSQL database management system with focus on the high availability. InfluxDB is a time series database. In general, it is designed to store and query time series based measurements.
Ivan Kudibal 00:03:03.602 The structure of this webinar has four parts. So the first one is going to be an introduction to InfluxDB Comparison framework, to be presented by Vlasta. The second one is the demo of InfluxDB Comparison. The third part will be focused on the Cassandra and the benchmarking and final the report. Having all these three done, we are going to say some conclusions about both the databases and we’ll open the Q&A. So let me give the ball up to Vlasta who will introduce you to the InfluxDB Comparison framework and the benchmarking use cases.
Vlasta Hajek 00:03:55.335 Okay. Thanks, Ivan. So let me first share my desktop. Hopefully you all see it. So as Ivan told you, we were running the benchmarks, as they were run in 2016, now with the latest versions of InfluxDB and Cassandra just to know the previous measurements that was done against InfluxDB 1.0 and Cassandra 3.7. So I will now briefly describe the methodology and the framework used to benchmark InfluxDB and Cassandra. The methodology and framework were originally designed and developed by Robert Winslow, and we’ve used it without any major modifications. Robert already detail explained the framework and the distinct approach in his webinars. See and go. And if you would like to hear more details, I advise you to watch those webinars. You can easily find them on the InfluxData website. We can spend hours in discussion about the best benchmarking methodology. Of course, there will become properties that each approach must hold. Especially, it must be realistic to simulate real-world usage. It must be fair and unbiased to all the databases, and also it must be perusable.
Vlasta Hajek 00:05:53.305 In our case, we’ve selected the real-world use case when there is a DevOps engineer who maintains a fleet of tens, or hundreds, or maybe even thousands of machines. He’s watching on a dashboard which shows the metrics gathered from those servers, hardware metrics such as CPU usage, Kernel, memory usage, disk space, disk I/O, network, and also application metrics about NGINX, PostgreSQL, and Redis. Those metrics are collected by agents running on the servers, let’s say, by Telegraf, and they are sent to CentralDB. If those metrics would be gathered each 10 seconds from hundreds of servers, we could see almost a continuous stream of data going into database.
Vlasta Hajek 00:07:15.416 Measurements have from 6 to 20 values, so it means 11.2 values in average. As there is 9 measurements, it gives us 101 values in total which are written during each data collection. Each measurement also has 10 texts where mostly system info and the data center metadata are placed, such as the name, region, data center, operating system, versions, and so on. The metrics are generated through the random generator with employing a random-walk algorithm to achieve reversible variance of data. And data sets, they can be generated for any length. For basic comparison, we’ve used data set with 100 hosts over the course of the day.
Vlasta Hajek 00:08:32.383 What we measured. For the time series use case and also for the other types of data storages, it’s definitely important to benchmark input data rate or ingestion rate. That is how quickly can the database ingest data written to it-what gives us the view of how much data can database handle at a time. We measure this in values per second, and the higher the value is, the better. It’s also important to know how the data are transferred to disk and what is the final disk host, and how efficiently the database engine uses the provided disk storage. Here, the lesser, the better. And of course we would like to perform queries and read the data as fast as possible to have our dashboard with our metrics respond fast, and show the appropriate values, and to be changed as quick as possible.
Vlasta Hajek 00:10:02.172 So we measure in a time interval during which is query processed and calculate possible number of queries per second. The higher the query rate is, the better, as our use case is storing time series data where there is no updates or deletes required. So there is no need to do other measurements. About the benchmarking framework, it’s written in Go language and it consists of a set of tools specific for using in different phase of benchmarking and different for each database. At first phase, we generate the input data in wire format specific for each database. For example, InfluxDB, the data are plain text. Cassandra’s ingestion data are also plain text. So the data preparedness reduces the further overhead for the [inaudible] tool. Then we use the generated data to transfer it over the network through database. To achieve the best performance, we employed fasthttp library for Influx, where the HTTP protocol is used, and used gocql library for Cassandra, which used Cassandra’s binary protocol version four. And the database bulk [inaudible] are engagement possible, and similar for queries. In format, specific for each DB, we generate various queries if we’re mainly in the search condition, and introduce the data to feed the query benchmarking tool, which sends the queries to database. Then we measured the response time and calculate min-max minimal use for achieving the best query performance, but the benchmarking tool doesn’t validate the results. So it’s possible to use the special debugging options which allows to print pre-formatted responses and compiled responses from both databases.
Vlasta Hajek 00:13:14.738 InfluxDB uses a so-called line protocol format, which is also nice, even readable and writable, as you can see. It consists of four parts: the name of the measurement, the set of texts, which is key-value pair. And in case of InfluxDB the text are indexed. Then we have the set of fields which are also key-value pairs, and they are typed and they are not indexed. And finally, we have the timestamp. InfluxDB uses nanoseconds precision by default, and we’ve also simulated the nanoseconds for Cassandra. This complete row defines in data point. In InfluxDB terminology, several points in the same measurement, and the measurement defines the series, such as in data point allows writing of several values at a time, as you can see. So maybe it could be understood that InfluxDB has a fixed schema. You have a solid set-measurement name, texts, fields, timestamps. However, it’s flexible as regards to number of texts and fields in a point. And an ingestion point can add a new type or a new field, and text can be totally skipped, so the only required is just the measurement name and at least one field.
Vlasta Hajek 00:15:24.582 And this, guys, we can see the points for memory measurement, network, and PostgreSQL.
Vlasta Hajek 00:15:46.094 For Cassandra, as it is general purpose NoSQL DB, and requires from the design how data will be stored. So the key characteristics are -we’ve chosen to use separate table for each value type. So different field type needs to be written to different table. So some are integers, or are floats or doubles. Some are Booleans, and others string blobs. However, in test simulation, we don’t use string blobs or Booleans, but we test 64-bit integers and 64-bit doubles, and we simulate the timestamp with the 64-bit integer. For primary key, we use series name and timestamp. For partition key, it is the series name. The series name is determined from measurement name, text, field name, and time bucket, and it’s done this way to be used by the client-side index, which will be described a few slides further. And the compact storage was chosen.
Vlasta Hajek 00:17:24.987 So Cassandra CQL for DDL is almost identical to what you may know from SQL. For those who don’t know Cassandra’s CQL, so I described before, each value is inserted separately to the table based on its type. So you can see the table has different names per type. But it means that there is a significantly bigger load to the database, compared to the InfluxDB where a same set of values are inserted in one point. So here you may note that the series name, which is this whole string, is made up of measurement name, the text, key-value pairs, the field name, and the time bucket.
Vlasta Hajek 00:18:59.627 For measuring query performance, we use query that would be typical example of source of data from DevOps Monitoring tool. In this case, we select the maximum CPU usage for user given the one-hour time interval, and it’s grouped by one minute. So here we have InfluxDB example which uses InfluxQL for writing queries, quite similar to SQL. Again, sorry. Readable and writable for a human. However, Cassandra query client required a fair amount of upfront work to get it to map time series use case so custom, high-level query to be designed to represent the Cassandra operations we needed to perform. So our Cassandra high-level queries are scatter-gather operations because Cassandra is a great building block for distributor systems. But it is, of course, fairly low level, and it was necessary to add support of the general time series use case. In particular, this means that we need to support [inaudible] querying based on text. And we needed to support grouping by time intervals, so we had to develop a piece of software called smart client, along with the designing the client-side index. This is middleware between the Cassandra and user.
Vlasta Hajek 00:21:11.334 So comparing to the InfluxDB, the Cassandra smart client is more complicated. So on startup, it reads the table metadata from Cassandra, and builds a client-side index. So at the start, it actually pulls data from Cassandra to populate the client-side data structures. And then for each incoming request, the benchmark client does the following. It parses the high-level query from standard input, and this means that it looks at what kind of data it’s actually asking about, and what text it might apply to, and to what time window. As mentioned earlier, we have to choose a way to chart or segment the data based on time buckets. And so this is where that logic come ins. It builds a query plan based on that, and it’s scatter-gather format. But aggregation can happen either on server or on the client.
Vlasta Hajek 00:22:46.912 To make it concrete, aggregation means, let’s say, the maximum CPU usage over a given time window. So if you’re asking about an hour interval over the course of the day, it will be 24 intervals, and we need to aggregate all the relevant data in each one of those intervals. And that can happen on the server or on the client. After building the query plan, it actually creates CQL queries. It means to execute and submits them to Cassandra. And when finally results come back, it does some merging within the time window to aggregate data requested. But one thing to know, that we generate SQL queries of benchmarking time, and this is kind of thing which would be better to do ahead of querying. But the reason why there is this middle-program logic is because you would actually have to do it in your application. So-called high-level query for Cassandra is actually comparable to query for InfluxDB, which is called InfluxQL, which is almost the same data. But InfluxDB takes the InfluxQL query and figures out that query for you at query time. Whereas, at Cassandra, we had to build that mechanism to do query planning and by result aggregation. So that happens at run time to be as realistic as possible.
Vlasta Hajek 00:25:10.933 So high-level query example for Cassandra. This format is generated by the query generator when choosing the format. It’s fairly simple format. So it’s just a set of key-value pairs. So we have the measurement name, say, it’s CPU, and the Fieldname, so we are asking about usage_user. The AggregationType is the maximum of these field values. The TimeStart and the TimeEnd defines the chosen time interval this [inaudible] applies to, and it’s actually two hours in our case. And the TimeStart and TimeEnd would be in the random hour interval inside that interval. And as you can see here, it happens in arbitrary offset, say, there is 6 second and 42 minutes. So it’s done. So the database are less likely to optimize the query responses. So if we get only our boundaries for these queries, database could actually start caching the responses. And using these fine-grained random hour interval let us do effectively cache busting of the databases. And then the second to last field is the group by duration. Here, we chose one minute. And then tag sets is just a set of filters that you would apply to the text before doing aggregations. And here we say we would like to get only data for the host name, host_0. And so this is so-called high-level query, and it will be parsed and utilized by query benchmarker, we called about the slide before. Now you hopefully know at least a little about the methodology and the framework, so let’s see how it is for usage. Tomas will show you the demo of using the benchmark tools for Cassandra.
Tomas Klapka 00:28:13.339 Okay. Thank you, Vlasta. Let me share my desktop. Okay. Hello, everybody. In the following couple of minutes, I’m going to show you a simple, practical demonstration of benchmark comparison tools. And today I’m going to benchmark the Cassandra 3.11, which I have already prepared on my local host. It runs in Docker container as it is the easiest way to get it online in a short time. Firstly, if you don’t have the Golang binary already installed on your machine, feel free to download it from official website. There are also useful guides how to install and get it working on most operating systems. After setting up your Go installation, it’s time for getting the command line tools from [inaudible] repository. I’m going to use the go get’ command, which is the standard way of downloading and installing packages and their dependencies. And in this example, we will need exactly four tools from our repository, and after go get’ is done with all the magic, we will be ready to go. At first, to generate some data, we need the bulk_data_gen tool. So now we can see here the command line for that tool. So I get it from the repository. For loading our data set to the database, we will use the bulk_load_cassandra tool. It’s the same commands. And similarly to the first, bulk data gen example, we need something for generating queries. It’s called bulk query gen tool. And finally, the query benchmarker tool for Cassandra.
Tomas Klapka 00:31:05.240 Okay. Let’s now look at the data set generation command. The bulk_data_ gen command will generate the data set according the input parameters, which we’ll use then as an input string for the next command called bulk load Cassandra. We use two workers here to make it a little bit faster. And our use case today will be DevOps, and format, of course, Cassandra.
Tomas Klapka 00:31:59.149 It takes some time. Okay. Now we got the ingestion rate of 18,000 values per second, finished after almost 12 seconds, and with 2 workers. The higher the number of workers, the more parallel clients you can simulate and theoretically get the higher ingestion rate. But everything depends on your network configuration, storage configuration, and also the database configuration. So after the main data loading, we can finally benchmark our queries. So let’s generate some queries. I’m going to use this command. In this case, it’s query type, one; host in one hour; format, Cassandra; use case still remains DevOps; and we will generate 100 queries for this demo purpose and send this stream to input for the query benchmark Cassandra tool, let some burn queries at the beginning, and our URL is pointing to my local host, 9042 port. Workers remain two and print intervals, zero, to see only the final results. Okay, there is it. We can see three lines. The second line means -the 1M-QP here means the current burning time, or time that it takes to plan and parse the high-level query to the CQL. And the last line, 1M-req means request round trip time, or the time of making these gadget requests. And the first line, finally, it’s the total time.
Tomas Klapka 00:35:11.136 Okay. That is pretty much all. And I hope that this showed you how easily it could be if you want to try it from scratch. And yes, there might be a quite high number of variation in parameters. However, they can help you in the future with adjusting your input data and requests to fit all your needs. So feel free to try it. All those tools are publicly available in Influx Benchmark Comparison repository. Thank you for watching, and back to Vlasta.
Vlasta Hajek 00:36:11.297 Okay. Thanks, Tomas. I will re-share the screen. So as you’ve seen now from Tomas the client side of the benchmarking, so how we use the tools, so what he did for benchmarking the DBs, the set up, and what are the results. So if you use both databases in the single-host deployment with their default installations without any tweaking. So there was no other important demos available or presented in the memory except Telegraf, which is part of TICK Stack from InfluxData, and it’s an agent for collecting system metrics. So we’ve gathered the information how database uses memory, CPU, and so on. And then for running in a single-host deployment, so to compare the difficulty of installing a setup [inaudible], we can say that both databases were similarly easy to install and run in the single-host deployment.
Vlasta Hajek 00:37:56.090 In our benchmarking, we ran benchmark on two types of hosts: cloud-based virtual host, and on-premise [inaudible] bare metal machine. Besides the benchmarks, we also wanted to validate where someone should be worried about the performance of cloud-based virtual machine comparing to the bare metal machine. So when following nowadays’ trends to go to clouds. So in our case we had HP Blades. So it’s Intel Xeon E5-2640 version 3, running at 2.60 GHz, with 60 cores and high [inaudible]. It means totally 32 virtual cores. A little drawback here was that the machines had SCSI hard drives instead of SSDs. In AWS, we’ve chosen c4.4xlarge machine because it has the similar parameters as the bare metal ones, but it has a faster CPU and it uses EBS SSD drives. So the result of this comparison is that AWS gives comparable results to bare metal ones, even it was a little faster, taking advantage of higher CPUs.
Vlasta Hajek 00:40:04.245 So the actual benchmark results. We were comparing InfluxDB 1.4.2 and Cassandra 3.11.1. We used a data set that simulated collecting metrics over the course of 24 hours from 100 hosts in 10-minutes intervals, which I gathered in 10-minutes intervals during those 24 hours. We used four parallel threads to simulate for workers during the load, and it was used also to send or achieve data. Querying was done as it was shown on examples. So use query to select maximum CPU for one host, group by one minute in random, one-hour interval. So the actual result, as you see the table, is that InfluxDB outperforms Cassandra in our time series use case in most cases by order of magnitude. InfluxDB have almost 14 times better ingest rate, and it’s also 16 times better disk-space saver. And when it comes to query performance, InfluxDB had about 30% better query time than Cassandra where we used the client query aggregation. So when we compare Cassandra’s query aggregation, server aggregation, we can see that client query aggregation give us the better outcome. Both databases also shown good vertical scalability when we’ve done backloading and querying with more clients. So in this case, it’s quite comparable. So just to repeat before the installations, when comparing those databases, so I can imagine that with some detailed tweaking we can get slightly different results. And again, we’ve used single deployment, so when we do further benchmark in clustered configuration, you can see more, maybe, interesting results.
Ivan Kudibal 00:43:40.660 All right. Thank you, Vlasta. So this is Ivan, back again just to conclude and initiate the Q&A. As for the conclusions, what we can say about these benchmarks. So InfluxDB is clearly a winner, and the Cassandra, if we used the client query aggregation, it performs 16 times better compared to the server query aggregation. Well, you can have questions with regards to horizontal scaling. Of course, Cassandra scales horizontally well-it is an opensource solution-while InfluxDB requires the enterprise license. We didn’t try the horizontal scaling, at least not in the context of this webinar, but what we can tell that if you can just live with one single host of InfluxDB, still you can serve the use case with a pretty good ingest rate, query rate. And even if you have a pretty expensive horizontal cluster of Cassandra, you can achieve the same results with just one single InfluxDB. This is good, maybe for your money, and, of course, the resources and time that you will need to invest in building the Cassandra high-availability solution. So what can we also say? If you wanted the queries that will provide the typical DevOps use cases, still the client will have to aggregate the data in its application logic, while the InfluxDB query engine provides this data out of the box. So InfluxDB better fits, at least in the use case of monitoring [inaudible] over to Cassandra. It has an excellent performance. You can get almost zero effort setup and maintenance costs. So you can use this solution with less storage, and still it scales enough for tons of virtual machines to be maintained. And with this in mind, I would finish the presentation.
Chris Churilo 00:46:59.360 Thank you, team. So we will keep the lines open for a few minutes. If you have any questions about the approach to the benchmarks, or the results, or even just kind of general questions about Cassandra and InfluxDB, feel free to put your questions in the chat and Q&A panels, and we will get them answered. So we’ll just stay here for just a few more minutes. And as we wait for you to put your questions in, just want to remind everybody that we have a training on Thursday, coming up. And we have our trainings every Thursday, webinars every Tuesday, so please join us for these future events. We also have InfluxDays New York coming up in February. I know February seems a little bit far, but it really is just around the corner. So if you happen to be in the area, please join us for InfluxDays, and you can get more information on influxdays.com. And finally, we’ll also make sure that we publish not only this webinar but also the companion paperwork that goes with this webinar for you to take a look at the results in a little more detail. So I’m going to put myself on mute for the next few minutes and just wait for your questions.
Ivan Kudibal 00:50:07.703 I see no questions, so let me just…We have one.
Chris Churilo 00:50:18.580 Yup. Go ahead and read the question out loud and then answer it.
Vlasta Hajek 00:50:25.891 So can you speak to high-resolution scalability? Well, we didn’t do exact measurements. Though we simulated the data with those stuff, if I understood correctly, which have too high resolutions, scalability, it means the kind of data. We used those data which [inaudible] those use case. And so gathering the values, is the values were using a random-walk algorithm. There were some limited to maximums and minimums because there were some other characteristics. So there wasn’t done as such measurements. So a third party’s, yeah, using the time series data from hundreds of thousands of sources, VMs with resolution as high as two seconds. InfluxDB scaled to this data size. Okay. So maybe this answers the resolution before. So maybe I was talking about something else. Sorry. So yeah, we are not experts in InfluxDB, but I would guess that other sources-even with each two seconds, as you can see, the InfluxDB had quite a good ingestion rate and good disk savings, so it should handle such data.
Chris Churilo 00:52:23.454 Yeah. And Nick, just to answer that further, so if you’re just speaking about InfluxDB, yes, we can definitely support. We can actually support all the way down to the nanosecond level if required. A lot of times we see that requirement more in kind of an IoT use case. And also, I think the other thing that we should probably dig in with you a little more is, what does your time series data actually look like? So at first blush, looking at that, looks like no problem. And we do scale out quite nicely, but if you want to speak with somebody from our team to go into more details, I’d be happy to help accommodate that so that you feel confident that we can handle your data with no problem.
Chris Churilo 00:53:23.215 Okay. So I want to thank the guys from Bonitoo for helping us today with our webinar. And if you guys have any other questions for them, feel free to just send them to me. You have my email address with the webinar invite, and I’d be more than happy to forward it to the guys and get you guys connected. Alternatively, you can always post questions in our community site, community.influxdata.com, and these guys will also be able to answer your questions there. And Nick, I will separately get you connected with someone from our team so you can chat further. So thanks, everybody. Thanks, Tomas, Vlasta, and Ivan. I really appreciate it. And we look forward to speaking with everybody on Thursday. Have a wonderful rest of your day.
Vlasta Hajek 00:54:11.567 Thank you, Chris. Bye.
Ivan Kudibal 00:54:13.475 Thank you too.
Tomas Klapka 00:54:13.682 Bye.
Ivan Kudibal 00:54:14.471 Bye.