Column Database

What is a column database?

Column databases, also known as columnar or column-oriented databases, are databases that store data on disk in columns rather than rows like traditional databases.

Early databases were designed for standard business workloads that are classified as Online Transaction Processing(OLTP). Row based relational databases work well for this type of situation because often queries need to use a number of values from the same row, so having all the data located next to each other on disk worked well.

An example of this would be returning all information about a specific product for a web page or getting all user information for their profile page. Having the entire row sequentially on disk allows all that data to be retrieved from disk in a single seek.

However, for analytics workloads where often only a few column values from a row are needed but the user wants that value for every row in the database, this row based storage isn’t efficient. The database is forced to scan the entire dataset and then throw out a bunch of unneeded data to get something simple like the average value for a column.

This type of workload is known as Online Analytical Processing(OLAP). The solution to get better performance for these types of queries was to design column-oriented databases that allowed users to grab only the columns of data they needed to fulfill their query rather than having to scan entire rows of unnecessary data.

Benefits of using a column database

So why bother with setting up and managing another database? Developers aren’t doing it for fun, they are doing it because column oriented databases provide some major benefits for businesses across all industries. Let’s look at a few of the primary benefits of column databases.

Performance

The obvious selling point of a column database is performance above all else. But how much of a difference in performance are we talking about? As stated before, it isn’t a small amount compared to relational databases, it’s orders of magnitude. Clickhouse is a popular open source column database that provides a benchmark suite of queries on identical hardware for popular databases which show the following results:

  • 195x faster than PostgreSQL
  • 570x faster than MySQL
  • 755x faster than MongoDB

clickhouse benchmarks screenshot

While no benchmark is perfect, this one gives a pretty fair look at performance across a dataset containing around 100 million records and 42 different types of analytics queries to test databases across a full range of analytics queries that are commonly used in production like grouping, sorting, and aggregating data points.

Cost savings

An obvious bonus of the same optimizations that provide such excellent performance for column databases is that they can get much better performance on less hardware. At the scale many businesses operate this can result in millions of dollars saved in hardware and storage costs.

Improved productivity and insights

For many analytics queries columnar databases can improve performance by orders of magnitude. This makes it much easier for developers and business analysts to iterate and innovate on how they are using their data. It also makes them much more productive as a result because in many cases queries that could take minutes or hours on a standard database can be returned in seconds.

FAQs

How does a columnar database hold data?

A columnar database holds data in columns instead of in a row-based format.

Is a columnar database relational or noSQL?

There are different outlooks on what specific builds define a noSQL database, and columnar databases themselves can differ quite a bit. Some of them that only change the row-based system to a column-based system may still be relational, and other kinds of variable columnar databases are categorized as noSQL.

What are some other names for a columnar database?

A columnar database might also be called a column-oriented database or a column database.

Are Columnar Databases noSQL?

Here’s where the jargon around the database industry gets a bit complicated. Any type of database design (such as column-oriented databases) that is not the traditional relational database design sometimes is called noSQL.

However, unlike certain key-value stores and other setups, data in conventional column-oriented databases are still often oriented according to location and not identity. Therefore, some purists say that the conventional, simpler columnar database is not a noSQL design in itself.

On the other hand, certain kinds of column-oriented databases can be characterized as noSQL.

For example, take a wide-column store database format where the naming and formatting of columns can vary from row to row. If this type of columnar database is similar to a key-value store system, that database build might qualify as noSQL in nature. In addition, it’s important to look at how the columnar database stores data and whether the columnar database preserves the components of the ACID design.

In a traditional database, ACID provides atomicity, consistency, isolation, and durability. Some other database builds give up several of those requirements to provide better performance. Another alternating system has emerged called BASE (Basically available, Soft state, Eventual consistency) and a noSQL columnar database model might use this instead.

So a noSQL database that doesn’t enforce ACID may be fundamentally different from another type of columnar database that is only changing the row requirement to a column requirement and not doing away with other aspects of the relational database.

In order to avoid splitting hairs, proponents of column-oriented database design could say that whether it is noSQL or not, it does innovate from the early relational database designs that were prevalent in past decades.

Related resources