How to Choose the Right Database for Your Workloads
Session date: Oct 11, 2022 08:00am (Pacific Time)
Learn how to make the right choice for your workloads with this walkthrough of a set of distinct database types (graph, in-memory, search, columnar, document, relational, key-value, and time series databases). In this webinar, we will review the strengths and qualities of each database type from their particular use-case perspectives.
Resources on other database types
- Key-value database
- Graph database
- Search engine database
- Time series database
- Document databases
- Relational Databases
Watch the Webinar
Watch the webinar “How to Choose the Right Database for Your Workloads” by filling out the form and clicking on the Watch Webinar button on the right. This will open the recording.
[et_pb_toggle _builder_version=”3.17.6” title=”Transcript” title_font_size=”26” border_width_all=”0px” border_width_bottom=”1px” module_class=”transcript-toggle” closed_toggle_background_color=”rgba(255,255,255,0)”]
Here is an unedited transcript of the webinar “How to Choose the Right Database for Your Workloads”. This is provided for those who prefer to read than watch the webinar. Please note that the transcript is raw. We apologize for any transcribing errors.
- Charles Mahler: Technical Marketing Writer, InfluxData
Charles Mahler: 00:00:00.366 All right. So I’m going to wait around here for about one minute or two before we get started and wait for everybody else to kind of trickle in here. In the meantime, if you haven’t registered for InfluxDays, we do have that upcoming at the start of November. So before you go on to the presentation, be sure to check out influxdays.com and you can see some of the stuff we’ll have available. Just a bunch of stuff on there about InfluxDB, the ecosystem, and some of the upcoming features. So if you haven’t seen that, be sure to check that out. All right. Let’s jump into it then. So basically, as you can see from the title, we’re going to going over kind of how to choose your database for your application or whatever doing. The agenda here is going to be go over the current existing database landscape. Some of the trends that we see. Basically, what makes them exactly perform differently? What’s the pros and cons? Why are these trade-offs there? Some of the technical aspects. Go over the different types of databases. For each of them, we’ll look at, again, the pros and cons, the use cases, and just the benefits of each of them. And then finally, we’ll end off with some future database trends and some features the databases are adding over time. And then we will have some time left for Q&A as well.
Charles Mahler: 00:01:59.235 So things to keep in mind is, again, don’t fall for just pure hype. When possible, you should probably try to keep it simple. Be logical about things. Don’t kind of FOMO into the new buzzword, whatever people are going crazy about. Keep it as simple as possible depending on your use case. Don’t make things overly complex. Think about the trade-offs. There’s nothing really magical in databases. Part of the reason NoSQL got a lot of hate at the start is because they acted like it was just magical and that it could do everything. And people once they adopted it realized that there were trade-offs. So don’t fall for the idea that there’s some magical database that’s going to solve all your problems. And there’s a lot of databases that have overlaps. So for some of these, you might be like, “Hey, I don’t think this database belongs in this category,” or it can be used for additional use cases. We have an hour here so I have to keep things kind of general and make some kind of simplification. So keep that in mind as well.
Charles Mahler: 00:03:05.722 So these are the main types of databases and the ones we’ll be covering today. We got relational. We got a bunch of different NoSQL types. And we’ll see the rest of those later. So got the current database landscape as it stands. So if you like this chart, this is from DB-Engines. And it’s essentially the estimated kind of market share or the usage rate of each of these databases. So again, going against the hype, relational databases are still by far the most popular type of database at around a little over 70% market share. But you can also see that you have document databases at over 10% now. You have key-value databases also having a good chunk, and you have graph databases also growing in popularity. So the key here is that NoSQL’s obviously growing, but relational’s still the go-to for most applications.
Charles Mahler: 00:04:05.315 But looking deeper at these numbers, we can actually see kind of the directional trend. So this is another chart from DB-Engines. You can see over the last two years the fastest-growing is time series database followed by graph and key value. So I think the important takeaway is that while relational databases are currently kind of dominant, there are companies moving towards these other types of databases. And there’s a reason for that. It’s because for whatever use case they’re fitted for, you’re going to get better performance, easier to use. So over time, you’re seeing this trend of slowly but surely these specialized databases are kind of chipping away at the general relational database. And obviously, they can use these together in conjunction as part of the same application, but there is a directional move towards these databases.
Charles Mahler: 00:04:57.603 So what exactly makes them perform differently from each other? So just looking at some kind of design considerations, these are like the down-in-the-weeds kind of technical details that actually make databases different from each other. So it can get really confusing. And obviously, you can go on for any of these things for a very long period of time looking at the details. But at the end of the day, the way I think about it, is that the database’s job is just to store your data and then allow you to access that database or that data that’s stored later. In theory, you could do it by hand. That’s how humans did it for a long time. They kept track of it on paper. You updated it. Did stuff like that. But with computers, it’ll obviously allow us to do it faster and at scale. So the simplest trade-off is basically read vs. write performance. So the fastest way you could write data is just append it to a file, and it’s going to be very fast. The problem then is that if you don’t do any sort of indexing, you’re going to have to scan the full data set every time you want to read that data. So that is, at the core, whatever database you’re working with, it’s the trade-off between how fast you write your data and how fast you can read your data later. And what slows down write performance is creating an index and then having to maintain that index over time. That also has other overhead in terms of like RAM memory requirements to maintain that index and keep it fast. So above all else, if you just don’t want to get confused thinking about it, just think about the trade-off between reading and writing data.
Charles Mahler: 00:06:34.734 So going a little bit into these other. On disk representation, that’s another aspect of performance for your database. Traditionally, it’s been kept in rows. All the data points for a row are kept sequentially on disk. But for some of these other newer databases for analytics workloads, the rows are now being broken and stored as columns on disk. And that makes it easier too, if you have a row of data where you just want one field, it makes sense to store it in that column format because then you don’t have to read a bunch of data that you’re just going to throw out when you’re analyzing it. Also, on disk some databases for analytics, they’ll actually keep the same data, but they’ll have it sorted in multiple different orders and organized in different ways. So that for different types of queries, you’re going to want to access that data in a different way, and it just makes it more efficient. It’s essentially like a form of pre-computing it and storing it on disk.
Charles Mahler: 00:07:29.743 Indexing is another big difference. So we’ll go into it a little bit later, but the big difference between relational and NoSQL is generally the index style, so. And the two big ones would be a B-Tree style vs. a LSM-Tree. Which is what most NoSQL databases use, some variation of an LSM-tree. B-Trees were designed mainly to kind of avoid disk seeks. They were designed for relational databases. And the big thing back then was that RAM was very expensive, so it was expected that you’re going to be hitting disk a lot. So you wanted to minimize disk seeks as much as possible just to enhance performance for their databases. LSM-Trees, generally speaking, from a performance perspective, they have better compression ratios and they’re much better for write performance. B-Tree’s generally are going to give you a better read performance, and they also have naturally kind of as a built-in feature of them, they’re good at doing transaction isolation because they can just lock down a range of keys in the index, and then you don’t have to worry about some of the various issues you can get when it comes to doing transactions on your data. Most databases, then also have the option to do secondary indexes beyond the primary index, which would be your B-Tree or your LSM-Tree. They also have the secondary indexes, which is where these specialized kind of databases kind of make their design decisions of what they want to focus on and what queries they really want to be performed. Then you have disk-based vs. memory-based databases, which we’ll go into with a pure in-memory database. We’ll go over some of the pros and cons of those. But the big thing is that traditional databases are designed around avoiding the worst-case scenario, which is having to go directly to disk and find that data on disk. When you go with a purely in-memory design, you can kind of just throw out all those trade-offs. You don’t have to worry about a lot of these different issues about seeking the disk and it gives you a huge performance improvement as a result.
Charles Mahler: 00:09:37.524 Compression. This is basically how to take your raw data and how you can save space on your hard drive and in memory by using different kind of strategies to compress your data. An example of this would be like there’s a lot of different compression algorithms. Snappy by Google is a pretty famous one. And what that does is it actually compresses data less efficiently than some other more popular algorithms. But the benefit is that it’s much faster to compress and decompress the data. So it gives you a speed boost in terms of performance. But the trade-off is you’re going to be kind of spending more money on disk space because you’re not compressing that data as much as possible. Hot vs. cold data. This is again something with more modern databases we’re seeing where they’re finding ways to optimize the cost of storage by moving data from memory to disk and then in some cases like object storage, so they don’t have to — for less frequently data, they can store it in the cheapest possible type of storage and optimize for costs that way. And then finally, how they handle durability, recovery. How they handle hardware failures, network failures. There’s a lot of different ways the database can fail. So different types of databases have, as part of their design, different kind of guarantees and trade-offs. And that is going to affect your performance.
Charles Mahler: 00:11:12.637 So how to choose the right database? The big thing, it just comes down to your use case. The data access pattern. So read-heavy vs. write-heavy workload. Analytics. Like OLAP vs. OLTP. Transactional processing. Those are the big things above all. Which goes back to don’t just choose a database based on hype. Think about your use case and then really choose the best one for that use case. Beyond data access, you have transaction isolation. If you’re doing financial data or healthcare or anything where you don’t want your data potentially having issues, where consistency of your data’s much more important than optimal performance, that’s something to think about. An example of this where it can bite you is that there was a cryptocurrency exchange where they went with a weaker isolation option for performance reasons and they got hit by a bug. Where basically a user would do — they were selling a ton of very small sell orders at the same time and then they’d place a big order for bitcoin. And because they didn’t have that strong consistency, they were basically giving this person hundreds of bitcoin at a time because their database was just missing the way they were hitting it. They didn’t have the consistency because of their settings on the database. And obviously, they lost a lot of money on that deal.
Charles Mahler: 00:12:40.322 Scalability requirements. So again, if you’re not really finished prototyping and you’re not really planning heavily on thinking about really how to scale or you just know based on what you’re doing if it’s an internal tool, it doesn’t need to scale a lot, you probably don’t need to worry about performance and replication and built-in horizontal scaling, that sort of thing. So it made more sense to just go with what you’re familiar with. Business stage. NoSQL, a big selling point of that is the schemaless nature. So if you’re kind of a start-up or a younger company, where you’re not entirely sure what your data models going to look like, it’s a big benefit to be able to kind of on-the-fly change up your data schema. So that’s a selling point in NoSQL. And then in-house talent and your in-house knowledge, I think, is also an important consideration just because having trained your whole team on something, that’s obviously a major trade-off if you want to move to a new type of database. So in a lot of cases, if you have that specialization, that skill for a certain database in-house, it might make sense to start with that.
Charles Mahler: 00:13:47.651 So the obligatory, looking into a little bit more. We kind of touched on it, but NoSQL vs. relational. One thing I should note is that — and you’ll see this as we go through these different types of databases — is that this really isn’t a good way to delineate the different types of databases. And in some cases, you’ll see that the NoSQL database is actually closer to the relational database than some of these other types of databases are, just because they’re so different from each other, but they still get thrown in as NoSQL. So that’s kind of an interesting thing. I think it’s just, at the surface level, people break it down like this, but it doesn’t make sense in a lot of situations. One thing to keep in mind too, it’s kind of funny, is that NoSQL is a new thing. But a lot of the ideas and concepts behind these databases are in some situations, older than relational database ideas that just the timing wasn’t right. The relational model really took off around this time of the ’70s just because it fit for the early businesses that could afford computer hardware at that time. So a lot of these are just kind of circling back to ideas and computer science concepts that have been around a long time and they’re finally kind of getting their chance to shine.
Charles Mahler: 00:15:02.475 So again, the — touched on earlier — big difference, the index structure LSM-Tree vs. B-tree. Operational databases are table-based in their structure. NoSQL, there’s a ton of different types of their storage, whether it’s column, which could also be classified as relational in a way, graph-based, that type of thing. You have the defined schema, flexible schema, which we’ve talked about. And then consistency trade-offs. It’s generally ACID Compliant for your relational databases. And for NoSQL, a lot of them — some now do support ACID transactions, but they also — one of the common ways to phrase this base is that they kind of make trade-offs and some of that in return for availability and scalability. And traditionally, you have relational where you’re going to have to scale it vertically. Which is just bigger server, bigger hardware. And most NoSQL databases are designed out of the box to scale horizontally.
Charles Mahler: 00:16:12.487 So now we’re going to get into looking at each of these databases in-depth. So start off with relational. Stored in a tabular format. Stored sequentially. Each row is sequentially on disk. SQL is used for querying. And a prime example is probably Postgres and MySQL. Pros and cons is that it’s very versatile. You can use it for a lot of different applications and you’ll get pretty solid performance. Probably the biggest benefit is a strong ecosystem. They’re battle-tested. It’s easy to hire a lot of people are familiar with them. And also, again, going back to built-in consistency is kind of a selling point of these databases. We have the other issue or the con side would be challenging to scale horizontally. There are some supports now for managing this, but out of the box, it’s going to be kind of challenging. You’re generally going to do it — go for vertical scaling. The B-Tree data structure results in — a lot of cases, it’s going to be harder to scale your write workloads. For read, you’re going to get good performance. And with stuff like caching, you’re going to be fine for read performance for the most part. But for certain types of queries and for handling higher volumes of write that’s going to be your major kind of pain points.
Charles Mahler: 00:17:40.801 So relational database use cases. Again, they’re suitable for almost any type of general-purpose application and you’re going to get a solid performance and they’re especially good for anything where you want those strong consistency guarantees. That’s a good fit. So key-value databases. Probably they’re really one of the first major NoSQL databases to get popular. And they’re made popular by primarily — Amazon put out, it’s called the Dynamo Paper, where they talked about their internal database, which is basically just a simple hash table. And they published that, and that kind of put these ideas out there, put these concepts out there, and a lot of other people created their own implementations based on that paper. So obviously it’s the simplest. You can think of it as basically a hash table or dictionary, like a programming language, but it’s just mapped to a database, and it allows you to essentially map any key to any type of data. It can be a file, it can be a URL, it can be just a number. Anything that gives you fast access to it. Some examples are going to be Amazon DynamoDB. And then open source versions of it you’d have Redis and Riak. So the pros and cons. The biggest selling points are going to be scalability. Just because the structure of it, it’s very easy to horizontally scale. You get great read-and-write performance. And the scheme is pretty flexible because you can just stuff anything in that key. Pretty much any block of data you can throw that in there. Weaknesses are consistency. That’s one of the trade-offs with the kind of traditional design is that, in theory, if a data gets updated, other nodes that are horizontally scaled might not have the most current up-to-date values. So you might get kind of a stale value when you query it. And the other thing would be that you have minimal querying capabilities with a traditional key-value database. I know DynamoDB, they’ve added a lot of like indexing, and you can actually query it with different stuff different way, but the traditional kind of concept of a key-value database, all you can do is hit that key. You have no metadata on what’s being stored inside that key, so you’re limited how you can query that.
Charles Mahler: 00:20:13.748 So some kind of example use cases are session management. So you can store like a cookie or something in that key-value database. It’s used for a lot of real-time features because you have great kind of scalability and also you can have it distributed so you’re closer to the user. You can have it in multiple data centers. And you also have personalized recommendations. The classic example obviously from the Dynamo Paper itself is Amazon needed to scale their shopping cart because during Black Friday and around Christmas they were having a ton of outages. I think they were using Oracle at the time. And so they realized that all the stuff, all these tricks I tried to pull to scale their relational database anyway, they no longer had many of the benefits by the time they denormalized it, by the time they’ve done read, replicas all this other stuff, they’ve lost like the asset guarantees, anyway. So they really had none of the benefits of relational database. And it wasn’t scaling, anyway. So they realized, “Hey, we can just upfront make these trade-offs with this new database. We know up front where we’re going to lose and we can plan around that.” And as a result, that was the creation of Dynamo.
Charles Mahler: 00:21:26.975 So onto document databases. This is probably the most general-purpose type of NoSQL database, I’d say. You can use it in a lot of the same ways you would use kind of a relational database. You get a lot of the same querying capabilities, performance capabilities. The benefit would be that you aren’t quite as restricted. It’s semi-structured data, so you’re not completely fixed on a table format. You still have some flexibility. But performance characteristics are very similar to a relational database. And some open source examples are MongoDB and CouchDB. Pros are the flexible schema, the performance, the built-in horizontal scalability out of the box. From a developer productivity perspective, the document style kind of maps better to your mental model. As opposed to relational database, where you kind of have to fit, like, “This is the data I want,” and you got to map it to sometimes multiple tables. For a certain use case, you have to put your data in either a single table and multiple and then you have to join it back together. So from that perspective, it can be easier just mentally working with a document database. You also get the benefit of data locality. So that’s stuff like not having — it’s essentially not having to do joins to pull that stuff together. That gives you a performance benefit because that object is stored on disk. So if you need a user profile, all that information is located on disk right next to each other. So it’s faster to grab that rather than have that data stored kind of spread across your hard drive. Weakness is going to be in some ways consistency. Some of these databases have added different types of transactions and stuff like that. But there’s varying levels of isolation on that. So it’s kind of not as consistent or reliable compared to kind of relational database. You don’t really know what you’re going to get in some ways. And different implementations. These different databases have different standards. So you can’t really 100% rely on it. So use cases. Again, like a relational database, it’s going to be pretty suitable for a lot of general-purpose applications. And it’s especially beneficial when you want to iterate a lot on your data model and stuff like that. You’re going to get some benefit from that.
Charles Mahler: 00:24:01.571 Graph databases. So used for kind of analyzing the relationships between data sets, between different points in your data set, different objects. They have specialized query languages. As of right now, they’re working on a standard one, but there’s right now no official graph query language that works across these databases, which is kind of a pain point. Different types. You have native, which is on disk that’s stored in a way that’s more efficient and that kind of maps better to the relationships in the graph data. So that’s kind of a talking point between these different databases is that there’s native graph, there’s non-native graph. That’s something you’ll see a little bit of. Big thing here is that currently from that chart, that utilization. About 5% of databases in use from that DB-Engines chart were graph databases. But if you look at Fortune 100 companies, 75% of them are using graph databases. So there’s a huge difference between the utilization of these between top-tier companies and the general kind of ecosystem. And I think that’s because, right now, they’re still not completely mature. So these big companies are seeing a lot of benefits from using graph databases, but the rest of the market hasn’t quite caught up. And that’s happened in other areas where we see these technologies, they start off at the big companies, and over time, different kind of vendors make them easier to use and the adoption grows over time. I think that’s kind of where we’re at with graph databases is it’s not entirely there. There’s no standard query language. And I think eventually once they become easier to use, you’re going to continue to see that — just like these big companies are seeing, you’re going to see everybody else start to use these where it makes sense.
Charles Mahler: 00:25:53.278 So the pros and cons is that graph traversals with these databases are constant time. If you try to do that with a different type of database, it’s not going to scale as well. As the graph gets bigger, you’re going to get pretty bad performance if you try to fit — if you try to fit these types of queries into a nonspecialized database, you hit developer productivity. There’s some examples, like trying to do a graph-type query with SQL, it’s going to result in a very, very long query. But these built-in graph languages, it’s a single line. So you get developer productivity. If you’re doing stuff where you’re working with graph-type data, finding relationships between data points, you get a lot of developer productivity benefits from using these databases. You also have a flexible schema as you’d expect. It’s kind of basically just an object. You have your edges. You have your nodes. Each node is basically just a point that you can add new properties to. So that makes it easy. You don’t have to pre-think out what your data’s going to look like. And it’s easy, again, to establish new relationships. It’s designed for that. So if all of a sudden you add a new field, a property to a data point, it’s easy to connect those together. Obvious thing. It’s a very specialized type of database. If you’re doing a standard web app where you had to use a relational database, it really doesn’t make a lot of sense for using a graph database in that situation. If your data doesn’t have a ton of interconnectivity and you’re not going to be analyzing those connections, it really doesn’t make sense to use one. Some example use cases are fraud detection. Social networks are probably the most obvious, like Facebook and LinkedIn. They both heavily use graft databases for finding like, “Okay. You’re friends with this person. This person’s friends with these people. It might make sense for you to try to connect with them or become friends with them,” that sort of thing. And then same thing goes for recommendation features where we know these customers have bought this product. We know you also bought this product. And they bought this third product. So then I’m going to recommend that to you based on that.” So they can find connections between these data points and it gives them a higher accuracy of like, “Hey, let’s put this in front of this person and see if they like it.”
Charles Mahler: 00:28:20.742 So time series databases. So this is what we are at InfluxData. So as you’d expect, they’re designed for working with time series data. They’re kind of interesting because they need some of the characteristics of both a row-based database and a column-based database. They kind of hit both ends of the spectrum. And from a performance perspective, they have to be able to handle both those types. Where in some situations, you might want all the data from a single server or a single sensor. You want everything from that. Another case, you might want one metric, like the CPU, for every server in your fleet of servers and that’s an entirely different type of query that’s getting hit from a different angle in terms of performance. And a single database they have to be able to serve all those types of queries. Big thing just from utilization, is they’re optimized for high write throughput and being able to query that data based on soon after being adjusted and also based on time ranges. So the indexing is optimized for being able to do those types of queries. They’re very specific and a standard database really would not be ideal for trying to query that type of data.
Charles Mahler: 00:29:39.053 So pros and cons. Obviously, very fast data ingest and query performance. Big thing is also developer productivity. Most of these databases are going to include out-of-the-box stuff for common kind of ways of using your time series data. So retention policies, aggregations, built-in functions in the query languages to aggregate data, to do certain types of time series analysis. Instead of having to write a bunch of custom code or kind of write advanced queries just out of the box, you get that. You can do it in one line. The cons are going to be part of the just optimization. Is that with time series data, you really don’t want to rewrite history, anyway. So they’re not optimized for updating a data point. And they’re also not optimized for deleting specific data points. They are good at data retention policies. Like, “After a week, delete all this chunk of data. I don’t want it anymore.” But for deleting specific data points, they’re really not great performers for that. But this kind of goes back to the trade-offs, which is that if you’re using a timeshares database for the typical workload, it’s not an issue because you know ahead of time that that trade-off is there. Common use cases. You have monitoring of various types. Could be application monitoring. Could be monitoring actual hardware in the real world, sensor data, IoT type stuff. And you also have a lot of financial-type use cases, and those were some of the first companies to build these databases. Where a lot of Wall Street firms they wanted to be able to analyze their time series data efficiently.
Charles Mahler: 00:31:26.803 So column databases. So we touched on this. Rather than storing it [inaudible], it’s stored on disk in a column format. The benefits of this are that each column is the same type of data. So they can use optimized compression algorithms, which helps with the size of the data and also helps with the processing speed. They are primarily used — they’re almost always used — for analytic-type workloads. And some examples would be ClickHouse, you have Vertica, you also have Redshift. And you could also throw in — InfluxDB’s new storage engine is called IOx. It’s open source, and that is primarily column based. And you can check that out. That’s on GitHub. And we’ll be talking about it a lot at InfluxDays.
Charles Mahler: 00:32:16.487 So pros and cons. So big one is that it’s more efficient for pretty much any type of analytics query where you don’t need that [inaudible] data. You just need one or two fields to maybe take an aggregate to maybe group by a couple of different things. It’s more efficient for that from an I/O perspective because you’re not pulling in a bunch of data you don’t need. You also have better data compression. So you’re saving money on storage costs. There’s no bottlenecks with I/O from memory to disk because you’re pulling in less data and it’s also more compressed. And another big thing is that you get better kind of CPU utilization. You get vectorized instructions. ClickHouse and all these other databases, they’re optimized for these new kind of CPU architectures and you get much better processing of your data. In a lot of cases, they can process the data in parallel on different CPU cores instead of being limited to a single core. Cons are going to be writing data can be less efficient because instead of just being able to write a new row in a single sequential place on disk, the way data is written is a little bit different. A little bit less efficient. And generally, instead of writing point by point, you’re going to want to write in batches. And it’s going to be slower if you’re — if you end up using it kind of like a relational database where you grab every piece of information or every piece of data that would normally be in a row, but you just grab it anyway, that’s also going to hit your performance.
Charles Mahler: 00:33:52.617 Some specific numbers is that, generally, you’re going to be about 30% slower for that type of operation. Where you try to, basically, grab every row and column, you’re going to take roughly, based on some kind of different research papers and benchmarks, they say it’s about 30%. But for these analytics workloads — so this is from ClickHouse’s. They have a benchmarking tool. It’s open source. This shows how ClickHouse performs compared to Postgres, MySQL, and MongoDB. We’ll see 200 times better performance against Postgres, almost 600 times better performance than MySQL, and almost 800 times better performance. And this is across, I think, a benchmark — it’s 42 different analytic-type queries on the same hardware. And you can see just for those types of queries, this is much better performance. Of course, if you went and tried to have ClickHouse do these other column databases do a normal relational workload, they would also struggle with that. So it’s not entirely fair. It’s comparing an analytics workload to more general workloads.
Charles Mahler: 00:35:09.445 So column use cases. Again, analytics, data warehousing, and specifically observability. So anything where you want to be able to pull specific data points, analyze them, huge amounts of data. Those are good fits. Some examples of column databases are Uber. They moved from the tool they were using to a column database. They got three times reduction in storage. So three times better data compression, which saved them obviously a lot of money. And for the same amount of hardware, they got 10 times better performance. And overall, they saw a 50% drop in their hardware costs.
Charles Mahler: 00:35:48.807 In-memory databases. So a prime example of this is going to be Redis and Memcached. The big kind of reason for the growth of these is that RAM over time has become much cheaper. Exponentially cheaper. So became viable to actually use this type of database rather than being reliant upon disk. Big thing is, again, you don’t have to worry about when you just throw out — it’s kind of like worrying about transactions with Dynamo, where you don’t have to worry about, in this case, storing any data to disk. It allows you to create more optimized data structures in memory as you create kind of more creative and more useful data structures that would be very hard to map to disk. But because you just don’t even worry about that, it becomes pretty easy. And it gives you much better performance characteristics as a result when you don’t have to worry encoding and decoding your data to disk. So I think there’s been some studies that, even comparing in-memory normal database to Redis, the same thing. You’re going to get about five times faster performance just because there’s no — the way the data’s structured going to be managed in-memory are just more efficiently. Don’t have to worry about writing the disk.
Charles Mahler: 00:37:05.938 So pros and cons. Again, high performance. Because it’s RAM, it’s pure memory. You’re going to get low latency and you have a lot more versatile kind of data types built into the database. Yeah. Downside is that RAM is much more expensive than disk even today. So for certain size data sets, it’s going to be pretty tough to, or it’s going to be, at the very least, expensive to use an in-memory database. You also have to manage scaling horizontally for very large data sets. There’s only so much RAM you can hook up. So you’ll have to still think about how you’re going to, in memory, spread that database across your servers. How you’re going to share that data. And in most cases, you’re going to still need a secondary database for actual persistent storage.
Charles Mahler: 00:37:55.905 So first, the most common use case is probably caching. It’s basically where you just take a database query and you store it in memory so you’re not constantly hitting your database for the same information. And you can also do stuff like anything with real-time applications. Anything where you need to frequently update data. That’s a good fit for in-memory databases. Some are even using it as part of a pub/sub architecture and different types of architecture where they use the in-memory database to pass messages around. And probably one of the most famous examples of using an in-memory database would be Facebook and how they managed Memcached. They [inaudible] billions of requests per second out of this in-memory database. And the goal is obviously you want to, instead of having to hit your database on the back end, you just put this cache up in front of it and it serves as much of your traffic as possible.
Charles Mahler: 00:38:51.469 Search databases. So these are kind of a specialized document database, which is just more emphasis. In a similar way, they’re stored on disk. Yeah. But you have an object or a document that holds information around it. But there’s a big emphasis on kind of the full-text search and being able to search through your data. Pros and cons. They have built-in stuff for searching through your data, for ranking algorithms, for making it easier to query your data and to get performance scalability out of the box. Obviously, horizontal scaling is a NoSQL database. That’s a big part of that. One thing they do struggle with is because they’re so heavily indexed, their data as it’s suggested for full-text search and those types of use cases is that that’s going to usually be the biggest hit is scaling rights is pretty tough with these databases. So use cases. You have log analysis. Full-text search. Real-time autocompletion. If they’re an eCommerce type store, where you type in a few words and it comes up with the potential products you could search for, these are common use cases for search databases.
Charles Mahler: 00:40:09.935 Vector databases. So these are kind of based on machine learning models. They allow you to store these vector embeddings that you generate from your model, and then you can serve that rather than having to continually use the model. It allows you to efficiently kind of reuse that output and search through these things. Find similarities. Find how different certain pieces of data are from each other. What’s cool about is it allows you — it’s not just text data. You can feed this anything and it will generate these vector embeddings. So it could be an image. Like how similar are these images? How similar are these videos? Is it text? Audio? Pretty much any type of data you can boil it down into these vector embeddings and then store in these specialized databases. A lot of these were developed, again, similar to some of these other technologies where it starts off at these big companies like I have I heard Amazon’s has internal versions of this for years. And now finally it’s kind of trickling out into an [inaudible] and open source version of it. And over time, the rest of us can use these cool technologies.
Charles Mahler: 00:41:25.341 So big pros and cons. You can get hybrid storage. So you can store it on disk or in memory for performance and cost reasons that you kind of optimize that. It’s horizontally scalable and you get what you expect from a vector database. You get efficient vector search. Downside, of course, be it’s very specialized. It’s still a very new technology. So it might not be worth the improved search results you get with this for vs. a more standard search technology. It might not be worth it right now. It might just not be worth managing another database for that purpose. But there are some cool use cases for it. So you have stuff like duplicate removal. If you’re running some sort of site where you have user-generated content and people are just spamming the same image, the same video, you can basically run this through, generate the vector embedding, and be like, “Okay. These are close enough that they’re basically the same image. Let’s just remove these images and do anomaly detection. Ranking the recommendation.” It’s basically anything you can do that is related to the similarity or the difference between data points. That’s what this is good for. And you see it a lot now. [inaudible] that open source thing. Some of their users are Walmart, Videa, eBay. So a lot of eCommerce platforms like that where they want to be able to find, similar to a graph database, similarities between data points and then find ways to kind of map those to other people.
Charles Mahler: 00:43:04.442 NewSQL databases. So this is kind of a buzzword, but the idea behind this is that it’s the best of both worlds. That you get kind of the good things you’d want from a relational database while also being able to scale like a NoSQL database. They have high consistency, so they’re able to do transactions across different regions. Even though it’s a distributed database, you still get those consistency guarantees. So you have open source versions. You have CockroachDB. You have TiDB. The first pretty much — what inspired a lot of these was Google’s paper called Spanner, which is about their own database. And kind of the inspiration for that was that they realized that eventually, most developers are going to want to do some sort of two-phase commit. And these databases that said — they just realized developers were in application code recreating or trying to recreate the ability to do these two-phase commits anyway, so it just makes sense at the end of the day to design the database to support that rather than having developers try to recreate it in code.
Charles Mahler: 00:44:10.025 So pros and cons. You get your SQL support. You get your horizontal scalability. It’s generally defined as cloud native. So it’s a friend of that type of architecture if you’re already building that way. Downsides are going to be the complexity even if it’s kind of hidden. I’ve seen people complain that because it’s so new as well, and there’s a lot of magic going on, that even though it’s supposed to just work seamlessly, there are still some growing pains. You have potential latency issues where — again going back to that magic — you try to make a basic query, but behind the scenes, it’s actually sending your query to a different data center. So you get some potential latency issues. Depending on the implementation, they might not completely support SQL. And again, the ecosystem around these is kind of still pretty young. So it’s a very new technology. Use cases. It’s like relational. The idea is for it to be just like a relational database except out of the box you just get massive scale. And you also have some support for something they’ve called HTAP, which is Hybrid Transactional/Analytical Processing. So some of these new SQL databases have also added like column extensions so you can do column storage. And also, to go with relational database stuff, you could also do you your analytics workloads as well. So it’s a lot of promises. If it works well, it’ll be interesting. But they’re still a pretty young technology.
Charles Mahler: 00:45:44.041 And just some kind of future database trends. One big thing I’m seeing a lot is that you’re seeing multi-model databases, which is where — kind of as an example, pretty much a lot of these — MariaDB would be an example or MySQL where they’re adding extensions for column support for analytics workloads and a lot of these other — I’ve seen some that are graph relational databases, where they’re kind of building maybe on Postgres for example. They build extensions on top of this. So you get a lot of benefits of your relational database and then they’ll add new capabilities on top of that. So that’s one thing you’re going to see a lot of is like databases really aren’t staying just in their niche. They’re kind of branching out, adding more functionality over time.
Charles Mahler: 00:46:30.586 Another interesting thing I’ve seen is machine learning-based indexing. So we talked about B-Trees, LSM-Trees. Those are kind of the classical database indexes. There’s been some research papers on using machine learning to basically create custom indexes for whatever data set you’re working with and that allows pretty much optimal performance because it can shape the index specifically on what you’re doing with your data set. Your custom data set. Everyone’s a little bit different and that’s what they’re trying to experiment with here. And from some preliminary results, they’ve seen stuff like two to three times faster read performance from these machine learning indexes. And they’ve also got the index itself is several orders of magnitude smaller. So that means you’re spending less money on storing your index in memory and stuff like that. So it’s faster while also being cheaper, which is a pretty good combination.
Charles Mahler: 00:47:33.803 So that’s pretty much it for this performance or for this presentation. I see some Q&A, so I’ll answer that. And again, if anyone has questions, be sure to — you can use that question-and-answer segment. And also, you can check out — for more stuff we have our website. . We have Slack if you have additional questions about InfluxDB in general. And for additional learning resources, we have InfluxDB University, where we talk about pretty much anything related to InfluxData, Telegraf, InfluxDB, that sort of thing. And then we have InfluxDays coming up like I said. So if you want to hear about our new column-based storage engine, you can check that out. Plus, some of our new features in Flux, our query language. A bunch of other stuff.
Charles Mahler: 00:48:25.876 So on to questions. Yeah. One is, yes, there will be a recording of this. So that’ll be sent out, I think an email with a link to this later. Does it support ODBC connectivity? I’m not sure which database that was in reference to, but I think the plan, at least for the new IOx storage engine, is it will have support for a lot of that stuff. I don’t know if that’s specific to — I see that was quite a while ago that question was asked. Not seeing anything else other than that. So the goal of this was kind of just give you an overview of database landscape and the ecosystem and kind of the different options. A lot of these are somewhat new. So if you haven’t been really paying attention to the database ecosystem or heard about them before, so hopefully, it helps you out. That’s going to be pretty much it. Hope everybody had a good day and hope everybody learned something. Like I said, be sure to check out some of the other resources. And look out for that email that’s going to be coming up soon with the recording and some other useful stuff.
Technical Marketing Writer, InfluxData
Charles Mahler is a Technical Marketing Writer at InfluxData where he creates content to help educate users on the InfluxData and time series data ecosystem. Charles' background includes working in digital marketing and full-stack software development.