Partitioning for Performance in a Sharding Database System
Nga Tran /
Product, Use Cases
Nov 18, 2022
This article was originally published in InfoWorld and is reposted here with permission.
Partitioning can provide a number of benefits to a sharding system, including faster query execution. Let’s see how it works.
In a previous post, I described a sharding system to scale throughput and performance for query and ingest workloads. In this post, I will introduce another common technique, partitioning, that provides further advantages in performance and management for a sharding database. I will also describe how to handle partitions efficiently for both query and ingest workloads, and how to manage cold (old) partitions where the read requirements are quite different from the hot (recent) partitions.
Sharding vs. partitioning
Sharding is a way to split data in a distributed database system. Data in each shard does not have to share resources such as CPU or memory, and can be read or written in parallel.
Figure 1 is an example of a sharding database. Sales data of 50 states of a country are split into four shards, each containing data of 12 or 13 states. By assigning a query node to each shard, a job that reads all 50 states can be split between these four nodes running in parallel and will be performed four times faster compared to the setup that reads all 50 states by one node. More information about shards and their scaling effects on ingest and query workloads can be found in my previous post.
Partitioning is a way to split data within each shard into non-overlapping partitions for further parallel handling. This reduces the reading of unnecessary data, and allows for efficiently implementing data retention policies.
In Figure 2, the data of each shard is partitioned by sales day. If we need to create a report on sales of one specific day such as May 1, 2022, the query nodes only need to read data of their corresponding partitions of 2022.05.01.
The rest of this post will focus on the effects of partitioning. We’ll see how to manage partitions efficiently for both query and ingest workloads on both hot and cold data.
The three most common benefits of data partitioning are data pruning, intra-node parallelism, and fast deletion.
A database system may contain several years of data, but most queries need to read only recent data (e.g., “How many orders have been placed in the last three days?”). Partitioning data into non-overlapping partitions, as illustrated in Figure 2, makes it easy to skip entire out-of-bound partitions and read and process only relevant and very small sets of data to return results quickly.
Multithreaded processing and streaming data are critical in a database system to fully use available CPU and memory and obtain the best performance possible. Partitioning data into small partitions makes it easier to implement a multithreaded engine that executes one thread per partition. For each partition, more threads can be spawned to handle data within that partition. Knowing partition statistics such as size and row count will help allocate the optimal amount of CPU and memory for specific partitions.
Fast data deletion
Many organizations keep only recent data (e.g., data of the last three months) and want to remove old data ASAP. By partitioning data on non-overlapping time windows, removing old partitions becomes as simple as deleting files, without the need to reorganize data and interrupt other query or ingest activities. If all data must be kept, a section later in this post will describe how to manage recent and old data differently to ensure the systems provide great performance in all cases.
Storing and managing partitions
Optimizing for query workloads
A partition already contains a small set of data, so we do not want to store a partition in many smaller files (or chunks in the case of in-memory database). A partition should consist of just one or a few files.
Minimizing the number of files in a partition has two important benefits. It both reduces I/O operations while reading data for executing a query, and it improves data encoding/compression. Improving encoding in turn lowers storage costs and, more importantly, improves query execution speed by reading less data.
Optimizing for ingest workloads
Naive Ingestion. To keep the data of a partition in a file for the benefits of reading optimization noted above, every time a set of data is ingested, it must be parsed and split into the right partitions, then merged into the existing file of its corresponding partition, as illustrated in Figure 3.
The process of merging new data with existing data often takes time because of expensive I/O and the cost of mixing and encoding the data of the partition. This will lead to long latency for responses back to the client that the data is successfully ingested, and for queries of the newly ingested data, as it will not immediately be available in storage.
Low latency ingestion. To keep the latency of each ingestion low, we can split the process into two steps: ingestion and compaction.
During the ingestion step, ingested data is split and written to its own file as shown in Figure 4. It is not merged with the existing data of the partition. As soon as the ingested data is successfully durable, the ingest client will receive a success signal and the newly ingested file will be available for querying.
If the ingest rate is high, many small files will accumulate in the partition, as illustrated in Figure 5. At this stage, a query that needs data from a partition must read all of the files of that partition. This of course is not ideal for query performance. The compaction step, described below, keeps this accumulation of files to a minimum.
Compaction is the process of merging the files of a partition into one or a few files for better query performance and compression. For example, Figure 6 shows all of the files in partition 2022.05.01 being merged into one file, and all of the files of partition 2022.05.02 being merged into two files, each smaller than 100MB.
The decisions regarding how often to compact and the maximum size of compacted files will be different for different systems, but the common goal is to keep the query performance high by reducing I/Os (i.e., the number of files) and having the files large enough to effectively compress.
Hot vs. cold partitions
Partitions that are queried frequently are considered hot partitions, while those that are rarely read are called cold partitions. In databases, hot partitions are usually the partitions containing recent data such as recent sales dates. Cold partitions often contain older data, which are less likely to be read.
Moreover, when the data gets old, it is usually queried in larger chunks such as by month or even by year. Here are a few examples to unambiguously categorize data from hot to cold:
- Hot: Data from the current week.
- Less hot: Data from previous weeks but in the current month.
- Cold: Data from previous months but in the current year.
- More cold: Data of last year and older.
To reduce the ambiguity between hot and cold data, we need to find answers to two questions. First, we need to quantify hot, less hot, cold, more cold, and perhaps even more and more cold. Second, we need to consider how we can achieve fewer I/Os in the case of reading cold data. We do not want to read 365 files, each representing a one-day partition of data, just to get last year’s sales revenue.
Hierarchical partitioning, illustrated in Figure 7, provides answers to the two questions above. Data for each day of the current week is stored in its own partition. Data from previous weeks of the current month are partitioned by week. Data from prior months in the current year are partitioned by month. Data that is even older is partitioned by year.
This model can be relaxed by defining an active partition in place of the current date partition. All data arriving after the active partition will be partitioned by date, whereas data before the active partition will be partitioned by week, month, and year. This allows the system to keep as many small recent partitions as necessary. Even though all examples in this post partition data by time, non-time partitioning will work similarly as long as you can define expressions for a partition and their hierarchy.
Hierarchical partitioning reduces the number of partitions in the system, making it easier to manage, and reducing the number of partitions that need to be read when querying larger and older chunks.
The query process for hierarchical partitioning is the same as for non-hierarchical partitioning, as it will apply the same pruning strategy to read only the relevant partitions. The ingestion and compaction processes will be a bit more complicated, as it will be more difficult to organize the partitions in their defined hierarchy.
Many organizations do not want to keep old data, but prefer instead to keep aggregations such as number of orders and total sales of every product every month. This can be supported by aggregating data and partitioning them by month. However, because the aggregate partitions store aggregated data, their schema will be different from non-aggregated partitions, which will lead to extra work for ingesting and querying. There are different ways to manage this cold and aggregated data, but they are large topics suitable for a future post.