Comparing InfluxDB and Cassandra for Time Series Data Management
In this session we’ll compare the performance and features of InfluxDB and Cassandra for common time series management workloads, specifically looking at rates of data ingestion, on-disk data compression, and query performance.
Watch the Webinar
Watch the webinar “Comparing InfluxDB and Cassandra for Time Series Data Management” by clicking on the download button on the right. This will open the recording.
[et_pb_toggle _builder_version="3.17.6" title="Transcript" title_font_size="26" border_width_all="0px" border_width_bottom="1px" module_class="transcript-toggle" closed_toggle_background_color="rgba(255,255,255,0)"]
Here is an unedited transcript of the webinar “Comparing InfluxDB and Cassandra for Time Series Data Management.” This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
Speakers: Chris Churilo: Director Product Marketing, InfluxData Robert Winslow: Independent Consultant for InfluxData
Robert Winslow 00:00 Welcome to another edition of InfluxData’s benchmark reports. Today, we’re going to take a look at how InfluxDB shapes up versus Cassandra. This presentation is going to be about our methodology, and to go along with this presentation is a report on our website that actually shows the numbers that we found when conducting this benchmark comparison. So I want to frame this with some high-level points. The first one is a question. Is InfluxDB faster than Cassandra for workloads you actually care about? We look at it this way because a lot of benchmarks are hard to relate to for a lot of users who aren’t intimately familiar with the databases that we’re talking about. So we’ve made a lot of efforts to build and test these databases or the benchmarks in such a way that our results are hopefully easy to understand. We have a few goals that help us outline this project. We’d like to be realistic. We want to answer performance questions for real-world users, not just other database engineers. In particular, there are a lot of micro benchmarks that most databases ship with that are sort of meaningful to people who work on the databases and become increasingly less meaningful as you go away from that use case. So here we compare databases on workloads that are as close to the real thing as possible. We support a real-world use case called DevOps, developer operations. And I’m going to get into details about how we do that in a few slides.
Robert Winslow 01:40 We’d like to be rigorous. Because we’re comparing two different databases, we’d like to make sure that the query results that come back are correct and identical. In particular, Cassandra’s a general-purpose data storer; InfluxDB is a special-purpose one. And so we had to build mechanisms to make sure that Cassandra was doing exactly what we needed it to be doing. Of course like any good science experiment, it should be easy to reproduce our results. It should be open, and our source code is all open on GitHub, and you can see the URL there. We love your comments, feedback, poll requests, and I’m happy to take a look at anything that you submit there. And then finally, and this is crucial, we need to be fair. So this is a project sponsored by InfluxData, but I am an external contributor, and one of my roles here is to try to show other databases in the best light possible. In particular, for Cassandra, we learned from Cassandra experts how to get the best performance. We had to do a lot of upfront engineering to make that happen, and I’ll get into that in some of the later slides. But we had to decide ahead of time what type of Cassandra queries to support. Of course, we had to design a custom schema to fit not only our data, but as is the case when you’re using Cassandra, we also had to predict what our queries would be. And because we’re storing time series data in Cassandra, we had to determine the best time granularity to store those points. We designed our own high-level query format, and I’ll get into that in later slides. And that goes along with a high-level query executor that we created, and so we did this so that we would model how you would actually use Cassandra in production to store time series data.
Robert Winslow 03:36 Let’s get into the methodology. There are five phases for actually running our benchmarks, and these all correspond to tools that we have in the open source suite. So I’ll go through them one by one. The first of five phases is to generate the data. We have a tool that is called bulk_data_gen, and it runs a simulation of a server fleet through time. It creates time series points that correspond to server telemetry readouts like you might get from a tool like Telegraf. So these are metrics collections that are used for, say, system administrators or data center operations or other developers to monitor servers and the software running on those servers.
Robert Winslow 04:28 In our simulation, everything is seeded or run using a pseudo-random number generator so that we can make the runs entirely deterministic. For each server that we simulate, we have nine course measurements that we simulate separately. And these correspond to what we would see in a Telegraf system status update. And so you can see here that we have a nine, and there’s the CPU, disk metrics, kernel memory, network, and a few applications also. So here we simulate Nginx, PostgreS, and Redis. Note that we’re not actually running any of this software. We are modeling the points that would be omitted from a telemetry service. Each simulated server has 10 static tags that identify it. So these are things, as you can see here, like the machine architecture, its location, a datacenter, the host name that we might give it, its operating system, and a few other variables. And we assign them once at the beginning of the simulation. They’re also from the random number generator process, and so they were also reproducible. Just for fun and for maximal realism, we actually used AWS region and datacenter names and other realistic values where we could.
Robert Winslow 05:59 And so the third part of the simulation, or the points that we simulate are the actual fields that we are creating. And so in the previous slide I’ve mentioned that there is the CPU and a number of other measurements. Here, we’re going to get into the actual fields for the CPU measurement. And you can see here that we have the usage of guest, guest_nice, idle, iowait, all these things, and again actually these are field names from the Telegraf telemetry service. And we model each field through time as a separate random walk. We try to correlate them where it makes sense, so for example free memory and used memory are inverses of each other, but in general these are separately modeled random walks. And so this is really important for modelling time series data because if one were to take the naive approach and effectively flip a coin for every time interval for what the value should be, you’d have a uniformly random distribution, and not only would that not be realistic because certainly there are usage patterns for servers where it’ll wander up or wander down, but also it won’t compress realistically, and a lot of time series data compresses in a way that is particular to the fact that it is a time series, and so one state that you observe is dependent on the previous state.
Robert Winslow 07:26 And so here’s a picture, a graph of one of the CPU simulations, and so you can see here it’s sort of wandering up and down between 0 and 100 percent through the course of a few minutes. And this of course is a server that’s experiencing a lot of very spiky load, but for the purpose of this simulation the random walk is the important part, really looking at how this is realistic and it’s not uniformly distributed. And finally, the fourth part of every point is the actual time that it occurs, and so every simulated data point has its own time stamp. Each time stamp represents the time since the beginning of Unix time, or the Unix epoch. For InfluxDB, we use nanosecond precision, and then for Cassandra setup we use 64-bit integers so that we can have nanosecond precision also. Finally, the simulation as points are emitted in the simulation, they proceed in 10-second epochs. So to make this more concrete, here’s complete simulated point that we’ve created from the InfluxDB simulator. So the first part, and this is all on one line-we’re using line protocol format in Influx-the measurement name is CPU. You can see we have machine-specific tags, and again these don’t change, and they represent particulars about the virtual machine that we’re pretending that we have or the server in the datacenter.
Robert Winslow 09:01 The third section is fields. And you can see here, these are all, as we mentioned before, CPU values. And so these are percentages between 0 and 100. And you can see that they all fit that. And they’re all fairly randomly distributed in there. Then, finally, the timestamp. So the data generator, at this point, supports a number of different databases. For this report, we’re just going to talk about how it supports the InfluxDB HTTP line protocol and Cassandra CQL. I’ll make a special note here to talk about the importance of serialization overhead. It’s very important to generate and serialize data before performing a bulk write benchmark. If we do this, if we create the data, serialize the format the database expects, and then write the data all during the benchmark, we could pollute our timing data. In particular, this is true with Cassandra versus Influx because their protocols for writing data have slightly different bandwidth requirements. It’s very slight. But also the actual active serializing can be expensive.
Robert Winslow 10:20 In particular, in Go, I’ve learned, and this is true of many other languages, working with floating point values, converting them to and from ASCII, is actually a fairly costly operation, to parse them correctly and to serialize them correctly. And so we do all of that ahead of time to minimize jitter and variance when running the benchmark. For this benchmark report, we generated the DevOps-1000 dataset. And what this means is it’s the dataset that simulates 1,000 hosts over various time periods. And so we have taken the idea of a simulation of DevOps data and condensed all that into single, what we call, scaling variable which is the number of simulated hosts that we have. And so for this, we have 1,000 simulated hosts. In the past, we’ve experimented with 100 simulated hosts. In the future, we’d like to experiment with 10,000 simulated hosts, and so on. And so what this should be interpreted as is this is the key kind of thing that you as you’re deciding which database to use, think in your mind, if I actually had a datacenter with 1,000 servers or a cloud with 1,000 VMs, and I wanted to look at the performance that I would see with InfluxData, or InfluxDB and Cassandra, this simulation maps as closely as we’d get to that actual use case.
Robert Winslow 11:52 We had to think of upfront design in Cassandra when we were designing a system. And so we looked at four different things among others. We had to create separate tables for each value type, so we call these blessed tables in the code base, and so what this means is that different field types need to be written in different tables, and so some are integers, some are floats or doubles, bools, or even string blobs. We don’t test string blobs or bools in the simulation, but we test 64 bit ints and 64-bit doubles. We also had to decide on the primary key. Here we use the series name and the time stamp. We chose a partition key as being the name of the time series including tags. And the way that we determine what the name of the time series is involves the client side index which I’ll talk about later. And finally, we chose compact storage. So to summarize the data generation phase, dated points are generated from a simulation of servers. Every 10 simulated seconds, each host emits 100 field values across 9 measure points. We store that data in a Z standard compressed file, and we decided the Cassandra data model ahead of time.
Robert Winslow 13:17 On to part two of five. We’re now going to load the bulk data and benchmark data performance. So we load the generated data from the simulation and measure write throughput. The data loading programs stream data from standard input. A loader combines input into batches and we send it to the database as fast as we can. Of course, parallel requests are supported. And part of the reason we do serialization out of band or ahead of time is to keep bulk loading as simple as possible. This makes the code base, of course, easy to maintain, but it also gives us confidence that it’s correct and that it’s as fast as we can make it. We have one bulk loader program per database, and this lets us specialize. And so here we have one for both Cassandra and then another one for Influx. They both take the parameters batch_size and parallelism, and so you can interpret this as parallelism, as how many requests to make in parallel, and batch_size is how many items are in each batch that get sent in a request.
Robert Winslow 14:30 When writing to Cassandra, we use the binary protocol version four with the Go library gocql. We did a lot of work to make the Cassandra loader fast and fair. Please let us know what you think. Now you can see a link to the source code there. On the InfluxDB side, it’s a straightforward HTTP client. We use the Go library fasthttp for better performance. We did find significant speed-ups when using that versus the standard Go library’s http client. So when you run this bulk loader program, here’s an example of the output it’ll give you. It loaded 3.8 million items in 36 seconds with 16 workers at a mean rate of 106,000 per second. So you can look at this as saying an item for Influx is a line in the line protocol which can have multiple field values as we saw before. So this is how many items are loaded and how many seconds would be given parallelism.
Robert Winslow 15:43 So now on to part three of five. We’re actually going to generate queries that we will run against the databases that we have loaded with the data from the previous phase. So we create these queries ahead of time and we save them to a file. And again, they’re from the simulation, so that DevOps-1000 dataset that I mentioned earlier that we based this report on, the queries also come from that. And they are simulation-based. The query generator logic, that simulation-based query generator, is shared between the databases. It’s called bulk_query_gen. This helps us at code style but also it makes sure that everything is the same between databases because actually, the code is only written once for the simulation. And I’ll make a similar point as to about writing, but it’s even more important at the query phase. Many benchmark suites generate and serialize queries at the same time as running benchmarks. And of course, we don’t. That would be a big mistake. Different databases and different ways of using those databases have different serialization overheads. And that could pollute the benchmarking results. We serialize them ahead of time. In general, we do as much work as we can ahead of time to make the actual numbers that we collect at query time as fair and low variance as possible.
Robert Winslow 17:15 So the way that we think about queries is that they’re templates, templatized. And so they’ll be one type of query that we’re testing against both databases. And there’s a template, say for Influx, for a particular query type. And then there’s a template for Cassandra for that same query type. And the structure of the query is the same but the parameters are different. And so they get populated from the simulation with values for the start and end time that the query is asking about. And say, server names or other parameters from the simulation that the query is filtering or aggregating over. So here are some example query types. Say we’d like to ask about the maximum CPU usage over four hosts from the simulation during a particular hour grouped by a minute, or the mean CPU usage over the entire server fleet during a whole day grouped by the hour and the host name. And in the code base, there are a number of different queries for the different databases.
Robert Winslow 18:24 So Cassandra required a fair amount of upfront work on our end to get it to map to the time series use case, so we had to design our own high-level query to represent the Cassandra operations we needed to perform. Our Cassandra high-level queries are scatter-gather operations. Cassandra is a wonderful building block for distributed systems, but it, of course, is fairly low level, and we need it to support the general time series use case. In particular, this means that we needed to support ad hoc querying based on tags. We needed to support grouping by time intervals, so we had to develop a piece of software that we called the client-side index, or the smart client, that is the middleware between Cassandra and the end user. So for Cassandra, the query generator makes high-level queries in a custom data format. It’s a fairly simple format. Here’s an example. It’s just a set of key value pairs, so we have the measurement name, say it’s CPU, and the field name we’re asking about, usage_user. The aggregation type is the maximum of these field values. The time start and time end are time stamps that are chosen from the time interval this simulation applies to, so let’s say that we generated a week of data for the DevOps-1000 simulation. The time start and time end in this example would be in a random hour inside that interval. And you’ll note that it happens in an arbitrary offset, say the seconds there is 6 and the minute is 42.
Robert Winslow 20:08 We do that so the databases are less likely to kind of optimize the query responses, so if we kept only hour boundaries for these queries, databases could actually start caching the responses because there are only so many hours in a week, and so using these fine grain random hour delineations lets us do effectively cache busting on the databases. And then the second to last field is the group by duration. Here we’re choosing one minute. And then tag sets is just a set of filters that we apply to the tags before doing aggregations, and here we just say, “We would like only data for the host name of host_0.” And then so this so-called high-level query will be parsed and utilized by the query benchmarker which we’ll talk about next.
Robert Winslow 21:00 Just to switch over to Influx for a second. So here are the Influx queries in the InfluxQL format. And this is a concrete query, like what you could enter at a console. So at the top, we have a query template, so we’re selecting the maximum from the usage_user field from the CPU measurement where time is greater than the start and less than the end. And again we’re filtering by a host name, and we’d like to group by a particular time interval. And then the second section here just shows that same query with some arbitrary but realistic values filled in for the template variables. And you can see here, we’re picking what looks like one day of data, and we’re grouping by an hour for a particular host name.
Robert Winslow 21:50 So each Cassandra request is stored as a gob-encoded Go type which we call an HLQuery or high-level query. And each Influx request is stored as a gob-encoded HTTP request. And again, these are being written to a file ahead of time before we run the queries against the databases. And so at query time when we’re pulling these in from standard in, we’ve noticed that we can decode about 700,000 of these payloads per second. And this has not been a limit for single-node benchmarking, but of course, when we scale up to clusters, we’ve looked at ways to maybe improve the serialization overhead even more, or to split up the query benchmarking so that it happens on multiple machines. And as I mentioned before, the query generator program uses a seeded random number generator to fill in the parameters. And just like the data generation phase, the output’s 100% reproducible, and it is based on the simulation approach as before. And when you run the bulk data generator for any of these databases, here’s some output that you might see. So if you don’t specify a seed ahead of time it’ll give you one. And so this shows you that it picked a random seed and generated a number of points for Influx and for Cassandra. The way to read this is that it generated for the first box, Influx queries of the max CPU type, random one hosts from whatever the server fleet is that we’re simulating, a random one-hour time window, group by one minute, and we generated 10,000 points of that query type. And you can interpret the Cassandra output exactly the same. And again, this is open source, so it’s called bulk_query_gen in the InfluxDB comparisons repository.
Robert Winslow 23:55 Now onto part four of five, we’re actually going to take those queries we just generated and run them against the databases. So we take our ready-to-go queries from a file, pipe them through standard input, and submit them as fast as we can to each database, potentially in parallel. There are two programs, similarly to the bulk load phase. This lets us specialize. Here we have query_benchmarker_cassandra and query_benchmarker_influx. And they both share the destination URL and the parallels and parameters, as well as some custom database specific parameters. One thing that I’d like to really make a special point about is the query benchmarker for Cassandra is very different than for Influx. As I mentioned earlier, Cassandra is a general-purpose building block. It’s not a complete solution to support the time series use case, at least out of the box. So we had to write a smart client for Cassandra.
Robert Winslow 24:59 I would say relatively speaking, the Cassandra smart client is complicated, especially when compared to InfluxDB. On startup, the smart client reads in table metadata from Cassandra and builds a client-side index. So at launch, it actually pulls in data from Cassandra before we can run queries against it to populate a client-side data structure. And then for each incoming request, the benchmarker client does the following. It parses the high-level query from standard in, and what this means is that it looks at what kinds of data it’s actually asking about and what tags it might apply to and what time windows. As I mentioned earlier, we had to choose sort of a way to shard or segment based on time buckets. And so this is where that logic comes in. It builds a query plan based on that. It’s a simple one, but it’s a scatter-gather format where the aggregations can happen either on the server or on the client, and to make that concrete, aggregation means, say, the maximum CPU usage over a given time window. So if we’re asking about half an hour intervals over the course of a day, it’ll be 48 intervals, and we need to aggregate over all the relevant data in each one of those intervals. And that can happen on the server, on the client, or combination of both.
Robert Winslow 26:28 After building this query plan, it actually creates the CQL queries it needs to execute. And then it submits those queries to Cassandra. And then finally as the results come back, it does final merging within time windows to aggregate data as requested. But one thing to note, that I noticed, is that we generate CQL queries at benchmarking time, and I said earlier that we would like to avoid this kind of work at query time. The reason we designed it this way is this is exactly what you would actually have to do in production. When you take a query from-when you compare the high-level query that we have for Cassandra, or our so called high-level query, and we compare it to what InfluxQL gives us, those are very comparable. It’s the same data, just in a different format. And so what Influx is doing is it’s taking an InfluxQL query and figuring out how to satisfy that for you at query time. With Cassandra, we had to build that mechanism to do query planning and query results aggregation. And so that all happens at runtime to be as realistic as possible.
Robert Winslow 27:45 So when you run the Cassandra query benchmarker, here is an output you might see. And I want to talk more about the text, not the numbers. You can see the full numbers in the benchmark report. And so I’ll go line by line. The first line says, “Burn-in complete after 100 queries with 4 workers.” This means that we took from our query dataset, we used up 100 at the beginning to just sort of warm up the database, and to say, get the database to load disk data in to RAM more applicable. Burn-in is a very common pattern in benchmarking. And then the second line says, “The run is complete after 5,000 queries with 4 workers.” And then it gives us a breakdown of how time was spent. And in this case, Cassandra’s special because it has two different parts that it’s measuring. It’s not just a single request. It’s actually measuring the total time which is the first line. You can see it says, “Cassandra max CPU, rand one hosts,” and so on. And then in the middle there, it just says “1m.” And that’s the whole time elapsed. The second timing line there, that field says “1m-qp.” And for this, that means how much time was spent query planning. In this particular example, it was about 200 microseconds. You can see here, 0.2 milliseconds. Which of course, is not much overhead at all. But we break it out so that we have a sense that that part is as efficient as we could make it. And again, the query planner is using data that was loaded up from Cassandra at boot.
Robert Winslow 29:25 And then finally, the third timing line is how much time was spent on making scatter-gather requests. And so we call that 1m-req. And you can see here that the total time per request mean adds up to its two components added together. And then finally, it shows how much wall-clock time was elapsed. For comparison, the InfluxDB benchmarker was very straightforward. We just submit InfluxQL queries over HTTP, and again, we use the Go library fasthttp. And this is what you’ll see when you run the InfluxDB benchmarker. And it’s similar to the elastic one, but it just has fewer lines because it’s a simpler approach.
Robert Winslow 30:17 Now we’re going to get to part five of five. We’re going to validate the results. And this is probably my favorite section. So we’re going to validate the results we got back from each database so that we can see that it matched our expectations. We really care about correctness. In particular, a lot of these databases are very different from each other. For example, InfluxDB is a special-purpose time series database, Cassandra is general-purpose, but also no-SQL database. And so in writing this software, we want to make sure it’s apples to apples as possible. And so query validation answers the following questions. Were the parameters and the points from the simulation identical for both databases? Did we load the points correctly? When we generate queries based on the simulation, were those queries identical for both databases? And were the queries semantically identical and correct? And again, for example, InfluxQL, we write queries in that format and then we translate those to say, the Cassandra high-level query system that we’ve created. This checks that those actually mean the same thing.
Robert Winslow 31:37 So what validation is for this benchmark speed, is that we check that the numerical results that we get back from a set of queries are identical. So here’s an example. In this scenario, we have generated data for both Cassandra and Influx, we’ve loaded that data into both databases, measured how that went, we’ve generated queries for both databases, and then we ran a set of those queries against each database in what we call validation mode, where we print the numeric results that we get back. And so here you can see, in the first gray box, that the Cassandra response we get back shows that it’s 4:33 AM UTC on January 1st. And the value that it gives back for this query is 32.89 and change. And then if you jump to the second gray box for Influx it’s actually the same. So it’s January 1st. 4:33 AM in the morning UTC, and then the numeric value is also 32.895 and change. And so we do this for a couple hundred queries after we run the benchmarks. And so this is kind of a record that the data was loaded correctly and that the queries are run correctly and they mean what we think they mean against both databases. When I say this is my favorite section, the reason I say that is because it confirms that this is really a fair comparison and that we’re looking at these databases that are very different under the hood and using them in the same way.
Robert Winslow 33:23 It’s not perfect. Floating point tolerances show small differences, and right now validation is semi-manual. We’re thinking about how to automate it. One suggestion was to actually use the simulation itself to derive kind of ideal or gold results. And then we can compare each database to that. So now I’ll wrap up. As I began this, I mentioned a few goals and so I’ll go over those goals again and ask how we did. So did we meet our goals? Were we realistic? Data and queries are modeled on a real-world developer operations use case. And the simulation approach creates plausible time-series data for a server fleet. Were we rigorous? Well, query validation proves correctness and identical semantics between databases. Certainly, we’re reproducible. Data generation and query generation are deterministic. We are open? All of our code is open-source. And were we fair? Of course, this one, there will always be people with strong opinions on both sides. For this project, in an ongoing manner, we use Casandra in the most favorable way we knew how. And that’s it. Thank you very much for your time.
Chris Churilo 34:53 Okay, Robert. Do you want to go ahead and read those questions aloud?
Robert Winslow 34:56 Sure. So I’ll go from the top. So what conclusions did your experiments lead you to? Well, I encourage you to read the report that’s coming out, which is an update to a report that we made last year. And I would say the conclusions were mainly qualitative. Cassandra is, of course, a very useful, very prominent database. And it’s really about-I felt that as a user of these databases while we’re adding these benchmarks, that because I already knew what I was trying to do with storing time series data and querying that in an ad hoc way, using Influx was a lot easier and out-of-the-box. So that would be my qualitative conclusion.
Robert Winslow 35:40 So the second question is where can we find the recording of the session. I’ll defer to Chris on that, but I think it’ll be up on the website in the near future. Ken thanks us for doing this. Yeah, I appreciate that. It’s been interesting to try to be comparing against so many different databases as we have. I’ve tried to make it as unpolitical as possible because I think everybody has strong opinions on these issues, so thank you for that. And here’s a question about, “Summarize the difference in response times between Influx and Cassandra.” I’m going to have to defer to the report on that. I will say that both are fairly fast, and again, it’s really about how much complexity are you willing to handle and support. I know that the smart client that I mentioned earlier in the benchmark, it’s good for the benchmark, and it represents what would be the beginning of what somebody would have to build to use Cassandra for time series. But again, making that more robust, production-ready, that is a piece of software that you’ll have to support. So in my view, that’s the much more important consideration.
Robert Winslow 37:00 There’s a question, “Could you please mention–?” Oh, sorry. “What special features are there in Influx which are not there in Cassandra?” I would say that, really, InfluxQL is the probably the primary thing there, so it’s a high-level query language for time series data exploration and querying, and that is actually what we modeled the Cassandra high-level queries off of. I also understand that there are time series specific features in Influx, say, Holt-Winters forecasting and so on, but I refer you to the copious documentation on the InfluxData website for more on that. And there’s a question about the hardware we use for the test. Sure, so this was a single-server, single-client test. The hardware was on-premise servers. They had a 40-gigabit length between the two, and they’re about six inches apart in a rack in San Fransisco, and they had NVMe hard drives. So extremely fast, and I believe that the core count was 16 each in their Intel Xeons. And I believe there’s about 32 gigs of RAM on each box.
Robert Winslow 38:26 So there’s a question, how much data can we store on Influx, where it doesn’t cause any performance issues? And in particular, can we do greater than 30 gig, greater than 60 gig? Well, one thing that I’ll counter that with is, Influx uses time series specific data compression. And so the sizes that you mention here, 30 gig or 60 gig and so on, that depends on how compressible your data is. In particular, if it follows any sort of trend through time, your data will be eminently compressible. So I would say that you could probably store quite a bit. This is in contrast to something like Cassandra, which, because it doesn’t know about the time series use case, it stores data in a compressed way that’s more about streams. And I’d have to check to see which actual compression mechanism they use. But Influx uses time series specific compression.
Robert Winslow 39:32 So there’s a question about how do you get the best performance through batching? Is there kind of a way that we could talk about throughput, megabits, or kilobits, or megabits per write, when having an HTTP endpoint? I would say that it’s a little bit empirical, for better or worse. But typically you’re not going to see a lot of improvement above 1,000 lines per request. I’m assuming that you’re asking about Influx here, because think about it from the point in the application. You have a couple different things going on, where you want updates to go to the database regularly, and ideally at low latency. And so on one extreme you would send one request per value. But that leads to a lot of network overhead and a lot of requests being made. And so you can batch them sort of as an application developer in time intervals, as we did in the simulation. Or you can look at just the maximum bandwidth per request. But again, I haven’t really found much improvement above 1,000. And note, also, that they can be compressed, supports gzipping compression. So that’s also a consideration.
Robert Winslow 40:54 There’s a question. Did you consider using the new TS front end over Cassandra and benchmarking? It was designed by the OpenNMS Project as a way to store time series. So we have not looked at that yet. So we derive these benchmark comparisons from the requests of our customers and potential customers. And what they’re asking for was how to use Cassandra for time series. And so we did not look at new TS. But if we start hearing that we need to look at this, I am more than happy to do it. And as I said before, Cassandra’s a performant general purpose data store and I think that it has a lot to offer, and it will be worthwhile to look at more than one front end for Cassandra so I appreciate that suggestion. Cassandra has a limit of two billion columns, sorry two billion column limit per row, yeah. Is there such a limit for measurement in Influx? To my knowledge no. So you’re just going to be writing timestamped values, and so the data structures it uses under the hood is something called a time series merge tree. Sort of like an LSM tree with time keys. And I believe the way that you’re supposed to kind of deal with high-volume data is you have retention policies. Maybe you discard old data or maybe you downsample it. Is it advisable to store in JSON format or do we always need to deserialize the data and store in Influx? JSON is not good for databases. Typically you want something that is compact and strongly typed. JSON is neither, so I would always advise actually storing in the database of your choice using the most compact and strongly typed option you have.
Chris Churilo 42:43 And Robert, Ed Gray also sent a question to the chat window.
Robert Winslow 42:49 Okay. What size servers were used to host the databases? Oh yeah, so these were pretty big boxes that we had at our San Francisco office here. So they’re connected with 40-gigabit links. They’re about a foot apart from each other and they have NVMe hard drives with, I believe, a terabyte of storage. And 16 CPU cores and 32 gigs of RAM. There’s a question. We’re seriously evaluating Influx for migrating data from an existing system to Influx. Are there any best practices we should keep in mind? Certainly, there are, and Influx has seen a fair amount of change in its underlying data structures over the last couple of years. I will defer the answer to that question to the support staff at the company as well as the online documentation. What I’ve tried to focus on is the performance numbers and the user experience when using Influx versus other databases. So I’m not an expert on that question. This is a question about migrating from MySQL. We have not run a benchmark against MySQL. Of course, I’m more than happy to look into it if there’s a need for it. Maybe you can talk to somebody at InfluxData about that. That’d be pretty interesting. So there’s also a question, have we looked at Druid? We have not yet, that’s coming up more and more though and I believe it’s on the roadmap. So stay tuned in the coming weeks and months for more on that. And again if you have any questions that you’d like to show your interest in seeing a benchmark report versus another database, please just get in contact with the InfluxData staff, and they’ll route the message to me and we can prioritize appropriately.
Robert Winslow 44:53 Chris, would you like to answer that one?
Chris Churilo 44:56 I think that was a perfect answer. So we want to put that on our list of things to review, and anybody can feel free to just send me an email and I can route your questions to the appropriate people. You should have my email in the invite to this webinar. And if not, it’s [email protected] and I can definitely make sure that I get that to you. I’ll send out an email later on today just to let you guys know that the recording’s up so you’ll have my email at that point as well. Lots of nice questions today. I appreciate everybody’s time, and I’m really pleased that you guys found the session to be informative. As I mentioned earlier I will post this recording, and we’ll also post the updated companion tech paper that actually has all the data in it as well. So once again Robert, fantastic job. I appreciate it and I think all of our attendees did as well. So thank you very much.
Robert Winslow 46:01 Thank you, everybody. I appreciate the time you took to watch this.