<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>InfluxData Blog - Nga Tran</title>
    <description>Posts by Nga Tran on the InfluxData Blog</description>
    <link>https://www.influxdata.com/blog/author/nga-tran/</link>
    <language>en-us</language>
    <lastBuildDate>Thu, 14 Nov 2024 07:00:00 +0000</lastBuildDate>
    <pubDate>Thu, 14 Nov 2024 07:00:00 +0000</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>Optimizing Queries in InfluxDB 3 Using Progressive Evaluation</title>
      <description>&lt;p&gt;In a &lt;a href="https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/?utm_source=website&amp;amp;utm_medium=direct&amp;amp;utm_campaign=query_optimization_progressive_evaluation_influxdb&amp;amp;utm_content=blog"&gt;previous post&lt;/a&gt;, we described the technique that makes the “most recent values” queries hundreds of times faster and has benefited many of our customers. The idea behind this technique is to progressively evaluate time-organized files until we reach the most recent values. Since then, we have received questions like “What queries support progressive evaluation?” “How do we verify that a query is progressively evaluated?” “Are there certain file organizations that progressive evaluation won’t help?” This blog post answers those questions.&lt;/p&gt;

&lt;h2 id="queries-that-support-progressive-evaluation"&gt;Queries that support progressive evaluation&lt;/h2&gt;

&lt;p&gt;Currently, this technique is only available for SQL queries; it is not yet applicable to InfluxQL queries. Your SQL query must have the clause &lt;code class="language-markup"&gt;ORDER BY time DESC&lt;/code&gt; (or &lt;code class="language-markup"&gt;ASC&lt;/code&gt;). In addition, some other limitations include &lt;strong&gt;&lt;em&gt;expressions&lt;/em&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;em&gt;aliases&lt;/em&gt;&lt;/strong&gt; (including &lt;code class="language-markup"&gt;AT TIME ZONE&lt;/code&gt;), and &lt;strong&gt;&lt;em&gt;aggregations&lt;/em&gt;&lt;/strong&gt; on the &lt;code class="language-markup"&gt;SELECT&lt;/code&gt; clause. In other words, everything in the &lt;code class="language-markup"&gt;SELECT&lt;/code&gt; clause must be simple table columns.&lt;/p&gt;

&lt;h3 id="examples-of-supported-queries"&gt;Examples of supported queries&lt;/h3&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT host, temperature 
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time ASC;

SELECT host, temperature 
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time DESC
LIMIT  10;&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id="examples-of-unsupported-queries"&gt;Examples of unsupported queries&lt;/h3&gt;

&lt;p&gt;These queries are not optimized using progressive evaluation yet. We hope to lift the restrictions in a future release.&lt;/p&gt;

&lt;p&gt;Query with an expression (&lt;code class="language-markup"&gt;temperature + 2&lt;/code&gt;)&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT host, temperature + 2
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time ASC;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Query with an alias (&lt;code class="language-markup"&gt;as host_name&lt;/code&gt;)&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT host as host_name, time 
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time ASC;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Query that specifies time zone and then alias (&lt;code class="language-markup"&gt;AT TIME ZONE ‘Europe/Oslo’ as time)&lt;/code&gt;&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT host, time AT TIME ZONE ‘Europe/Oslo’ as time 
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time DESC
LIMIT  10;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Query with an aggregate &lt;code class="language-markup"&gt;(min(temperature))&lt;/code&gt;&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT min(temperature)
FROM   machine 
WHERE  time &amp;gt; now - interval ‘1 day’ and region = ‘US’
ORDER BY time DESC
LIMIT  10;&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id="signal-that-a-query-is-evaluated-progressively"&gt;Signal that a query is evaluated progressively&lt;/h2&gt;

&lt;p&gt;If &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt; is in your query plan, it is optimized using progressive evaluation (see &lt;a href="https://www.influxdata.com/blog/how-read-influxdb-3-query-plans/"&gt;this post&lt;/a&gt; for how to get and read the query plan). However, the absence of progressive evaluation does not mean your query will run slowly. We only apply it when it actually benefits your query.&lt;/p&gt;

&lt;h2 id="file-organizations-that-benefit-from-progressive-evaluation"&gt;File organizations that benefit from progressive evaluation&lt;/h2&gt;

&lt;p&gt;To understand when progressive evaluation benefits your query, we first need to understand data organization, which is one of the most important factors affecting query performance.&lt;/p&gt;

&lt;h3 id="data-organization-in-influxdb-30"&gt;Data organization in InfluxDB 3.0&lt;/h3&gt;

&lt;p&gt;Below is a brief description of how data is organized in InfluxDB 3.0 (see our &lt;a href="https://www.influxdata.com/blog/influxdb-3-0-system-architecture/?utm_source=website&amp;amp;utm_medium=direct&amp;amp;utm_campaign=query_optimization_progressive_evaluation_influxdb&amp;amp;utm_content=blog"&gt;system architecture&lt;/a&gt; and &lt;a href="https://www.influxdata.com/blog/compactor-hidden-engine-database-performance/?utm_source=website&amp;amp;utm_medium=direct&amp;amp;utm_campaign=query_optimization_progressive_evaluation_influxdb&amp;amp;utm_content=blog"&gt;data compaction&lt;/a&gt; for the complete data cycle, how data is compacted, and how it benefits query performance).&lt;/p&gt;

&lt;p&gt;As a time series database, data from the table &lt;code class="language-markup"&gt;machine&lt;/code&gt; always includes a &lt;code class="language-markup"&gt;time&lt;/code&gt; column representing the time of an event, such as temperature at 9:30 am UTC. Figure 1 shows three different stages of data organization. Each rectangle in the figure illustrates a chunk of data. &lt;strong&gt;C&lt;/strong&gt; represents data that is not yet persisted and usually includes the most recent values. &lt;strong&gt;L&lt;/strong&gt; represents the level of different persisted files. &lt;strong&gt;L0&lt;/strong&gt; is used for files of newly ingested and persisted data. They are usually small and contain recent values. However, L0 files of backfilled data can be as old as desired. &lt;strong&gt;L1&lt;/strong&gt; files store the results of compacting many small L0 files. We also have &lt;strong&gt;L2&lt;/strong&gt; files but they are beyond the scope of this topic and do not change how progressive evaluation works.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/6qkOCQrnjhOT8zlx7yZj4w/81b28dfcb4df0a72ef72d1ea830b2ccb/Data_organization_in_InfluxDB_3-0.png" alt="Data organization in InfluxDB 3-0" /&gt;&lt;/p&gt;

&lt;p class="has-text-centered"&gt;Figure 1 (borrowed from the &lt;a href="/blog/compactor-hidden-engine-database-performance/"&gt;compaction blog post&lt;/a&gt;): Four stages of data organization after two rounds of compaction.&lt;/p&gt;

&lt;p&gt;In stage 1, all data are in small L0 files. In stage 2, data in stage 1 has been compacted to larger L1 files, while some new data are persisted in a few small L0 files and some are in a not-yet-persisted chunk (C). If you ingest new data most of the time, your data organization mostly looks like stage 2 or stage 3. However, if you backfill data, your data organization can combine stages 1 and stage 2 or stage 3. Thus, depending on how you ingest data and how fast the compactor keeps up with your ingest workload, there may be few or many overlapped and small files. Stage 3 is what we call “well-compacted data” and is usually best for query performance. The goal of the compactor is to have most of your data in stage 3. Avoiding frequent backfilling of data also helps keep your data well-compacted.&lt;/p&gt;

&lt;h3 id="application-of-progressive-evaluation-in-various-overlap-scenarios"&gt;Application of progressive evaluation in various overlap scenarios&lt;/h3&gt;

&lt;p&gt;Let’s go over examples of querying different data sets. Figure 2 shows the data organization of the table &lt;code class="language-markup"&gt;machine&lt;/code&gt;. We use F for the prefix name of the files, which can be either L0 or L1. Files F1, F2, F6, and F7 do not time-overlap with any files. Files F3, F4, and F5 overlap with each other, and file F8 overlaps with F9, which overlaps with chunk C.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/1jwYhroMEa2u7bjbHyPSgi/4924600d5db1a009e0aa4723f0fa7d63/Data_organization_of_table_machine.png" alt="Data organization of table machine" /&gt;&lt;/p&gt;

&lt;p class="has-text-centered"&gt;Figure 2: Data organization of table &lt;code class="language-markup"&gt;machine&lt;/code&gt;&lt;/p&gt;

&lt;h4 id="reading-non-overlapped-files-only"&gt;Reading Non-Overlapped Files Only&lt;/h4&gt;

&lt;p&gt;If your query asks for latest data before t1:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT temperature 
FROM   machine 
WHERE  time &amp;lt; t1 and region = ‘US’
ORDER BY time DESC LIMIT  1;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The query will be optimized with progressive evaluation because the files needed, F1 and F2, do not overlap. The simplified query plan is as follows:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;ProgressiveEvalExec: fetch=1
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F2], [F1]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A few essential properties in the query plan are needed for &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt; to work correctly:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Files in &lt;code class="language-markup"&gt;ParquetExec&lt;/code&gt; are sorted on time descending F2, F1.&lt;/li&gt;
  &lt;li&gt;&lt;code class="language-markup"&gt;preserve_partitioning=[true]&lt;/code&gt; means data of 2 file groups, [F2] and [F1], are sorted in their own group and won’t be merged. This is important for us to be able to fetch data from F2 before fetching F1.&lt;/li&gt;
  &lt;li&gt;&lt;code class="language-markup"&gt;fetch=1&lt;/code&gt; means the query will stop running as soon as it gets a row that meets the query filters. In other words, if there is at least one row in file F2 with &lt;code class="language-markup"&gt;US&lt;/code&gt; as a region, F1 will never be read.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same query sorted on time &lt;strong&gt;ascending&lt;/strong&gt; will look like this:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;ProgressiveEvalExec: fetch=1
    SortExec: TopK(fetch=1), expr=[time ASC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F1], [F2]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You will get a similar query plan if your query reads data between t2 and t3 that include only non-overlapped files.&lt;/p&gt;

&lt;h4 id="reading-overlapped-files-only"&gt;Reading Overlapped Files Only&lt;/h4&gt;

&lt;p&gt;Now let’s look at the query reading data between t1 and t2.&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT temperature 
FROM   machine 
WHERE  time &amp;lt; t2 and time &amp;gt; t1 and region = ‘US’ 
ORDER BY time DESC LIMIT  1;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Because all three needed files, F3, F4, and F5, overlap, we need to merge data for &lt;a href="https://www.influxdata.com/blog/using-deduplication-eventually-consistent-transactions/"&gt;deduplication&lt;/a&gt; and they cannot be evaluated one by one progressively. The simplified query plan will look like this:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[false]
  DeduplicateExec:
     SortPreservingMergeExec:
         ParquetExec: file_groups={3 groups: [F4], [F3], [F5]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Without &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt;, the files in &lt;code class="language-markup"&gt;ParquetExec&lt;/code&gt; can be in any order and group because the groups will be read in parallel and merged into one stream for deduplication.&lt;/p&gt;

&lt;p&gt;Similarly, progressive evaluation won’t be applied if your query reads overlapped data after t3.&lt;/p&gt;

&lt;h4 id="reading-a-mixture-of-non-overlapped-and-overlapped-files"&gt;Reading a Mixture of Non-Overlapped and Overlapped Files&lt;/h4&gt;

&lt;p&gt;When your query reads a mixture of non-overlapped and overlapped data, progressive evaluation is applied, and the data is split and grouped accordingly. Let’s look at a query that reads data before t2.&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT temperature 
FROM   machine 
WHERE  time &amp;lt; t2 and region = ‘US’ 
ORDER BY time DESC LIMIT  1;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The simplified query plan will look like this:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;ProgressiveEvalExec: fetch=1
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[false]
       DeduplicateExec:
          SortPreservingMergeExec:
              ParquetExec: file_groups={3 groups: [F4], [F3], [F5]}
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F2], [F1]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Even though files F3, F4, and F5 overlap, they do not overlap with F1 and F2 and contain more recent data. Therefore, the &lt;strong&gt;subplan&lt;/strong&gt; of F3, F4, and F5 is progressively evaluated with F2 and F1. Note that the number of input streams into &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt; is three: one for the merge of F3, F4, and F5, one for F2, and one for F1. These three streams are evaluated progressively in that order.&lt;/p&gt;

&lt;p&gt;If the query is sorted &lt;strong&gt;ascending&lt;/strong&gt;, the progressive order will be opposite:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;ProgressiveEvalExec: fetch=1
    SortExec: TopK(fetch=1), expr=[time ASC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F1], [F2]}
    SortExec: TopK(fetch=1), expr=[time ASC], preserve_partitioning=[false]
       DeduplicateExec:
          SortPreservingMergeExec:
              ParquetExec: file_groups={3 groups: [F4], [F3], [F5]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Three streams still go into &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt; but in the opposite order: F1, F2, and the F3, F4, and F5 merge.&lt;/p&gt;

&lt;p&gt;Similarly, let’s read all data:&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;SELECT temperature 
FROM   machine 
WHERE  time &amp;lt; now and region = ‘US’ 
ORDER BY time DESC LIMIT  1;&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The query plan is getting more complicated but follows the same rules: subplans of overlapped data will be put in the right order and progressively evaluated with non-overlapped files.&lt;/p&gt;

&lt;pre class=""&gt;&lt;code class="language-sql"&gt;ProgressiveEvalExec: fetch=1
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[false]
       DeduplicateExec:
          SortPreservingMergeExec:
              SortExec:
                 RecordBatchExec: {C}
              ParquetExec: file_groups={2 groups: [F8], [F9]}
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F7], [F6]}
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[false]
       DeduplicateExec:
          SortPreservingMergeExec:
              ParquetExec: file_groups={3 groups: [F4], [F3], [F5]}
    SortExec: TopK(fetch=1), expr=[time DESC], preserve_partitioning=[true]   
         ParquetExec: file_groups={2 groups: [F2], [F1]}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;There are six non-overlapped streams of data progressively evaluated by &lt;code class="language-markup"&gt;ProgressiveEvalExec&lt;/code&gt; in this order:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The merge of C, F8 and F9&lt;/li&gt;
  &lt;li&gt;F7&lt;/li&gt;
  &lt;li&gt;F6&lt;/li&gt;
  &lt;li&gt;The merge of F3, F4 and F5&lt;/li&gt;
  &lt;li&gt;F2&lt;/li&gt;
  &lt;li&gt;F1&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If your query orders data on time ascending, you will have a similar query plan but in the opposite order.&lt;/p&gt;

&lt;h2 id="cache-implications-progressive-evaluation-may-help-other-queries-latency"&gt;Cache implications: progressive evaluation may help other queries’ latency&lt;/h2&gt;

&lt;p&gt;For workloads highly dependent on cached files to get the lowest latency possible but running near the cache limit, progressive evaluation can make a significant performance difference. Obviously, not traversing extra parquet files reduces CPU time for the optimized query. Depending on how excessive the time bound is compared to the bounds of the LIMIT, the fact that the database may bring far fewer files into cache means other queries’ files need not be evicted, potentially reducing the latency of other queries on the system.&lt;/p&gt;

&lt;h2 id="summing-up"&gt;Summing up&lt;/h2&gt;

&lt;p&gt;If your query selects pure table columns and orders data on time, InfluxDB 3.0 will automatically use progressive evaluation to improve your query performance, even if a subset of the query data contains non-overlapped files. Progressive evaluation cannot be used when all your data overlaps.&lt;/p&gt;
</description>
      <pubDate>Thu, 14 Nov 2024 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/query-optimization-progressive-evaluation-influxdb/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/query-optimization-progressive-evaluation-influxdb/</guid>
      <category>Developer</category>
      <author>Nga Tran, Reid Kaufmann (InfluxData)</author>
    </item>
    <item>
      <title>Making Most Recent Value Queries Hundreds of Times Faster</title>
      <description>&lt;p&gt;This post explains how databases optimize queries, which can result in queries running hundreds of times faster. While we focus on one specific query type that is important to InfluxDB 3, the optimization process we describe is the same for any database.&lt;/p&gt;

&lt;h1 id="optimizing-a-query-is-like-playing-with-lego"&gt;Optimizing a query is like playing with Lego&lt;/h1&gt;

&lt;p&gt;You can come up with different structures when playing with the same set of Lego pieces, as shown in Figure 1. While you often use the same basic bricks to build whatever structure you want, there are times when you need a different type of shape (e.g., tiny star) for a specific project.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2RtVIx96OBI6OvTvsRxNbn/48174cf3d415f81308ffb55676cd35b8/lego-structure.jpg" alt="lego-structure" /&gt;&lt;/p&gt;

&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 1: Two different structures built from the same basic Lego squares and rectangles&lt;/p&gt;

&lt;p&gt;In a database, running a query means running a &lt;strong&gt;query plan,&lt;/strong&gt; a tree of different operators that process and stream data. Each operator is like a Lego brick: depending on how they are connected, they compute the same result but with different performances. Much of query optimization involves swapping or moving existing operators around to form a better query plan, but on some rare occasions, a new special case operator is needed to do the job better.&lt;/p&gt;

&lt;p&gt;Let’s walk through an example of optimizing a query by creating a specialized operator and recombining existing operators to form a query plan with superior performance.&lt;/p&gt;

&lt;h1 id="querying-the-most-recent-values"&gt;Querying the most recent value(s)&lt;/h1&gt;

&lt;p&gt;As a time series database, one common use case is managing signal data from many devices. A common question is: “&lt;em&gt;What is the signal last sent by a specified device (e.g., device number 10)?”&lt;/em&gt;  The answer to this question (or variations of it) is often used to drive a UI or monitoring dashboard. Using SQL, a query that can answer this question is:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT   …
FROM     signal
WHERE    device = 10
       AND time BETWEEN now() - interval 'X days' and now()
ORDER BY time DESC
LIMIT    1;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The filter &lt;code&gt;time BETWEEN now() - interval 'X days' and now()&lt;/code&gt; narrows down the question a bit: “&lt;em&gt;What is the signal last sent by device number 10 for the last X days?”&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;While this query is simple, actual queries can be more complicated. For example, “find the average value over the last five values,” so our solution must be able to handle these more general queries as well.&lt;/p&gt;

&lt;p&gt;It is also important that these queries return results in milliseconds, because every device owner requests values frequently. One challenge is the owner does not know when the last signal happened—it could be five minutes ago or several months ago. Thus, the value of time range &lt;code&gt;X&lt;/code&gt; can be very large and the query runtime long. Unless users take great care writing their query, it will read and process substantial data, increasing the query return time.&lt;/p&gt;

&lt;p&gt;Unlike traditional relational or time series databases, InfluxDB 3.0 stores data in &lt;a href="https://parquet.apache.org/"&gt;Parquet&lt;/a&gt; files rather than custom file formats and specialized indexes. Our mission was to make this class of queries run in milliseconds regardless of how large the time range X is without introducing special indexes.&lt;/p&gt;

&lt;h2 id="runtimes-before-and-after-improvements"&gt;Runtimes before and after improvements&lt;/h2&gt;

&lt;p&gt;Before explaining our approach, let us look at the results in Figure 2. Blue represents the normalized runtimes of the queries in different time ranges before the improvements, and green represents the ones after. Queries timeout after running for 30 units, so the actual runtimes of queries that reached 30 units were even higher. As the chart shows, our improvements made large-time-range queries run hundreds of times faster and brought the runtimes of all queries down to the level requested by our customers.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/7IfaM65vUEb0rJPNqMInvW/f0a9ccc91b9ca6af6f2dc80c674a47e9/Before-after.jpg" alt="Before-after" width="650" height="auto" /&gt;&lt;/p&gt;

&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 2:  Query runtimes of before and after Improvements&lt;/p&gt;

&lt;p&gt;Let’s move on to how we achieved this.&lt;/p&gt;

&lt;h2 id="query-plan-before-improvements"&gt;Query plan before improvements&lt;/h2&gt;

&lt;p&gt;Figure 3 shows a simplified version of the query plan before the improvements using a &lt;strong&gt;sort merge&lt;/strong&gt; &lt;strong&gt;algorithm&lt;/strong&gt;. We read a query plan from the bottom up. The input includes four files that four corresponding &lt;strong&gt;scan&lt;/strong&gt; operators read in parallel. Each scan output goes through a corresponding &lt;strong&gt;sort&lt;/strong&gt; operator that orders the data by descending timestamp. Four sorted output streams are sent to a &lt;strong&gt;merge&lt;/strong&gt; operator that combines them into a single sorted stream and stops after the number of limit rows, which is &lt;code&gt;1&lt;/code&gt; in this example. There are many more files in the &lt;code&gt;signal&lt;/code&gt; table, but InfluxDB first prunes unnecessary files based on the filters of the query.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/7KskmsN8d9l2cvMnrmUcqg/90c20bd031cb0851c9f47535b9864689/Query_plan_using_sort_merge_algorithm.png" alt="Query plan using sort merge algorithm" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 3: Query plan using sort merge algorithm&lt;/p&gt;

&lt;p&gt;When files overlap, InfluxDB may need to‌ &lt;a href="https://www.influxdata.com/blog/using-deduplication-eventually-consistent-transactions/"&gt;deduplicate&lt;/a&gt; data. Figure 4 shows a more accurate plan that sorts data of overlapped files, File 2 and File 3, together and deduplicates them before sending data to the sort and merge operators.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/7FP49Aj7Z2w6eXcXUupC53/a8840af34737b625e0d7d9613ac83b4c/InfluxDB_query_plan_using_sort_merge_algorithm_but_grouping_overlapped_files_first.png" alt="InfluxDB query plan using sort merge algorithm but grouping overlapped files first" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 4: InfluxDB query plan using sort merge algorithm but grouping overlapped files first&lt;/p&gt;

&lt;p&gt;The optimization described in the next section only depends on the operators at the top of the plan, and thus, the simplified plan in Figure 5 more clearly illustrates the solution. Note that we omitted many other details of the plan—for example, the Sort operator does not sort the &lt;em&gt;entire&lt;/em&gt; file but simply retains the &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.TopK.html#background"&gt;Top “K” rows&lt;/a&gt; due to the &lt;code&gt;LIMIT 1&lt;/code&gt; in the query.&lt;/p&gt;

&lt;p&gt;Note that the data streams going into the merge operator do not overlap.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/3YzjJQXB2uW2ZJEh6N5T7F/5c62e5bbeb56e3dd0988be2cab18bcd6/Figure-5.png" alt="Figure-5" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 5: The top part of the plan that includes non-overlapped data streams to merge operator&lt;/p&gt;

&lt;h2 id="analyzing-the-plan-and-identifying-improvements"&gt;Analyzing the plan and identifying improvements&lt;/h2&gt;

&lt;p&gt;Normally when merging multiple streams, all inputs must be known before producing any output (as the first row might come from any of the inputs). This implies in the above plan that we must read and sort all the input streams. However, if we know the time ranges of the streams do not overlap, we can simply read and sort the streams one by one, stopping once we find the required number of rows. Not only is this less work than‌ merging the data, but if the number of required rows is small, it is likely only a single stream must be read.&lt;/p&gt;

&lt;p&gt;Thankfully, InfluxDB has statistics about the time ranges of the data in each file before reading them and groups overlapped files to produce non-overlapped streams, as shown in Figure 5. So, we can apply this observation to make a faster query plan without additional indexes or statistics. However, the behavior of reading streams one by one, stopping when the limit is hit is no longer a &lt;strong&gt;merge&lt;/strong&gt;. We needed a new operator.&lt;/p&gt;

&lt;h2 id="new-query-plan"&gt;New query plan&lt;/h2&gt;

&lt;p&gt;With the observations above in mind, Figure 6 illustrates the new query plan:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The non-overlapped data streams are sorted by time, descending.&lt;/li&gt;
  &lt;li&gt;A new operator, &lt;strong&gt;ProgressiveEval,&lt;/strong&gt; replaces the &lt;strong&gt;merge&lt;/strong&gt; operator&lt;strong&gt;.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The new ProgressiveEval operator pulls data from its input streams sequentially and stops when it reaches the requested limit. The big difference between &lt;strong&gt;ProgressiveEval&lt;/strong&gt; and &lt;strong&gt;merge&lt;/strong&gt; operators is that the &lt;strong&gt;merge&lt;/strong&gt; operator can only start merging data &lt;strong&gt;after all&lt;/strong&gt; its input &lt;strong&gt;sort&lt;/strong&gt; operators complete, while &lt;strong&gt;ProgressiveEval&lt;/strong&gt; can start pulling data immediately after the&lt;strong&gt;first sort&lt;/strong&gt; operator finishes.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/2HpuKw5qq5D2hmwxBpw9lS/9b247c9b8fecb148c617285f7ae72b23/Figure-6.png" alt="Figure-6" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 6: Optimized Query plan reads data progressively, stopping early when the limit is reached&lt;/p&gt;

&lt;p&gt;When the query plan in Figure 6 executes, it only runs the operators shown in Figure 7. InfluxDB has a pull-based executor (&lt;a href="https://docs.rs/datafusion/latest/datafusion/index.html#execution"&gt;based on Apache Arrow DataFusion’s Execution&lt;/a&gt;), which means that when &lt;strong&gt;ProgressiveEval&lt;/strong&gt; starts, it will ask the first &lt;strong&gt;sort,&lt;/strong&gt; which in turn asks its inputs for data. The &lt;strong&gt;sort&lt;/strong&gt; then performs the sort and sends the sorted results up to ProgressiveEval.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/pkPxbYrsfCye8tK8TY8MN/ee1c6e11d732414505eab89b87f5b365/Figure-7.png" alt="Figure-7" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 7: The &lt;b&gt;needed&lt;/b&gt; execution operators if the latest signal of the specified device is in the latest-time-range file&lt;/p&gt;

&lt;p&gt;Due to the &lt;code&gt;device = 10&lt;/code&gt; parameter, the query filters data while scanning, and we do not know in which file contains the latest signal of device number 10. In addition, because it takes time for each &lt;strong&gt;sort&lt;/strong&gt; operator to complete, when &lt;strong&gt;ProgressiveEval&lt;/strong&gt; pulls data from a stream, it also starts executing the &lt;em&gt;next&lt;/em&gt; stream to prefetch data that is necessary if the first stream doesn’t contain the desired rows.&lt;/p&gt;

&lt;p&gt;Figure 8 shows that when pulling data from Stream 1 of the first &lt;strong&gt;Sort&lt;/strong&gt;, the second &lt;strong&gt;Sort&lt;/strong&gt; executes simultaneously so that data from Stream 2 is ready if Stream1 does not include the requested data from device number 10. If the data from device number 10 is in Stream 1, &lt;strong&gt;ProgressiveEval&lt;/strong&gt; stops as soon as it hits the limit and cancels Stream 2. If data from &lt;strong&gt;ProgressiveEval&lt;/strong&gt; pulls data from Stream 2, it also begins pre-executing the &lt;strong&gt;Sort&lt;/strong&gt; of Stream 3, and so on.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding-top: 30px;" src="//images.ctfassets.net/o7xu9whrs0u9/1kcqJqhvYHLRyizKiDF397/bac21e45feb01dc60c681dd0499e2dfa/Figure-8.png" alt="Figure-8" /&gt;&lt;/p&gt;
&lt;p class="is-italic has-text-centered" style="font-size: 16px;"&gt;Figure 8: The &lt;b&gt;actual&lt;/b&gt; execution operators if the latest signal of the specified device is in the latest-time-range file&lt;/p&gt;

&lt;h2 id="analyzing-the-benefits-of-the-improvements"&gt;Analyzing the benefits of the improvements&lt;/h2&gt;

&lt;p&gt;Let’s compare the original plan in Figure 5 and the optimized plan in Figure 8:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Returns Results Faster&lt;/strong&gt;: The original plan must scan and sort all files that may contain data, regardless of the number of rows needed, before producing results. Thus, the longer the time range, the more files there are to read, and the slower the original plan is. This explains why our results show improvements for longer time ranges.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fewer Resources and Improved Concurrency&lt;/strong&gt;: In addition to producing data more quickly, the optimized plan requires far less memory and CPU—it typically will scan and sort only two files (the most recent one and pre-fetching the next most recent one). This means more queries can run concurrently with the same resources.&lt;/li&gt;
&lt;/ol&gt;

&lt;h1 id="type-of-queries-that-benefit-from-this-work"&gt;Type of queries that benefit from this work&lt;/h1&gt;

&lt;p&gt;At this time (March 2024), this optimization only works on one type of query, &lt;strong&gt;“What are the most/least recent values&lt;/strong&gt; …?” In other words, the SQL of the query must include &lt;code&gt;ORDER BY time DESC/ASC LIMIT n&lt;/code&gt; where ‘&lt;em&gt;n&lt;/em&gt;’ can be any number and the time can be ordered ascending or descending. All other supported SQL queries will work but may not benefit from this optimization. We continue to work on improving them.&lt;/p&gt;

&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;The optimization not only makes the most recent value queries faster but also reduces ‌resource usage and increases the concurrency level of the system. In general, if a query plan includes &lt;strong&gt;sort merge&lt;/strong&gt; on &lt;strong&gt;potentially non-overlapped data streams&lt;/strong&gt;, this optimization is applicable. We have found many query plans in this category and are working on improving them.&lt;/p&gt;

&lt;p&gt;We would like to thank Paul Dix for suggesting this design based on the progressive scan behavior of Elastic in the ELK stack.&lt;/p&gt;
</description>
      <pubDate>Mon, 18 Mar 2024 08:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/making-recent-value-queries-hundreds-times-faster/</guid>
      <category>Developer</category>
      <author>Nga Tran, Andrew Lamb (InfluxData)</author>
    </item>
    <item>
      <title>How to Read InfluxDB 3 Query Plans</title>
      <description>&lt;p&gt;This blog post explains how to read a query plan in InfluxDB 3 and requires basic knowledge of &lt;a href="https://www.influxdata.com/blog/influxdb-3-0-system-architecture/"&gt;InfluxDB 3 System Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/products/influxdb-overview/"&gt;InfluxDB 3&lt;/a&gt; supports two query languages: SQL and InfluxQL. The database executes a query written in either SQL or InfluxQL according to the instructions of a &lt;strong&gt;query plan&lt;/strong&gt;. To see the plan without running the query, add the keyword &lt;code&gt;EXPLAIN&lt;/code&gt; in front of your query as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;EXPLAIN 
SELECT   city, min_temp, time 
FROM     temperature 
ORDER BY city ASC, time DESC;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The output will look like this:
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/33278ca707ad45aa82be47f0aa95f35f/9fb26989df6f13d72e2177d939c52d4c/Reading_Query_Plans_Diagram_1_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
&lt;em&gt;Figure 1: A simplified output of a query plan&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are two types of plans: the logical plan and the physical plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical Plan:&lt;/strong&gt; This is a plan generated for a specific SQL or InfluxQL query without knowledge of the underlying data organization or the cluster configuration. Because InfluxDB 3 is built on top of &lt;a href="https://github.com/apache/arrow-datafusion"&gt;DataFusion&lt;/a&gt;, a logical plan is very similar to what you would see with any data format or storage in DataFusion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physical Plan:&lt;/strong&gt; This is a plan generated from a query’s corresponding logical plan plus the cluster configuration (e.g., number of CPUs) and underlying data organization (e.g., number of files, the layout of data in the files, etc.) information. The physical plan is specific to ‌your data and InfluxDB cluster configuration. If you load the same data to different clusters with different configurations, the same query may generate different physical query plans. Similarly, running the same query on the same cluster at different times can have a different plan depending on your data at that time.&lt;/p&gt;

&lt;p&gt;Understanding a query plan can help explain why the query is slow. For example, if the plan shows that your query reads many files, you can add more filters to reduce the amount of data it needs to read or modify your cluster configuration/design to create fewer but larger files. This document focuses on how to read a query plan. Techniques for making a query run faster depend on the reason(s) it is slow and are beyond the scope of this blog post.&lt;/p&gt;

&lt;h2 id="a-query-plan-is-a-tree"&gt;A query plan is a tree&lt;/h2&gt;

&lt;p&gt;A query plan is an upside-down tree and should be read from the bottom up. In tree format, we can represent the physical plan of Figure 1 in the following way:&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/3da21dd56a8c46c3b40549e65574b215/263550b96e1efae93de09d7526417fd2/Reading_Query_Plans_Diagram_2_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
&lt;em&gt;Figure 2: The tree structure of physical plan in Figure 1&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The name of each node in the tree ends with &lt;code&gt;Exec&lt;/code&gt; to indicate an &lt;code&gt;ExecutionPlan&lt;/code&gt; that processes, transforms, and sends data to the next level of the tree. First, two &lt;code&gt;ParquetExec&lt;/code&gt; nodes read &lt;a href="https://www.influxdata.com/glossary/apache-parquet/"&gt;Parquet&lt;/a&gt; files in parallel, and each node outputs a stream of data to its corresponding &lt;code&gt;SortExec&lt;/code&gt; node. The &lt;code&gt;SortExc&lt;/code&gt; nodes are responsible for sorting the data in &lt;code&gt;city&lt;/code&gt; ascending and &lt;code&gt;time&lt;/code&gt; descending. The &lt;code&gt;UnionExec&lt;/code&gt; node combines the sorted outputs from the two &lt;code&gt;SortExec&lt;/code&gt; nodes, which are then (sort) merged by the &lt;code&gt;SortPreservingMergeExec&lt;/code&gt; node to return the sorted data.&lt;/p&gt;

&lt;h2 id="how-to-understand-a-large-query-plan"&gt;How to understand a large query plan&lt;/h2&gt;

&lt;p&gt;A large query plan may look intimidating, but if you follow these steps, you can quickly understand what the plan does.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;As always, read from the bottom up, one &lt;code&gt;Exec&lt;/code&gt; node at a time.&lt;/li&gt;
  &lt;li&gt;Understand the job of each &lt;code&gt;Exec&lt;/code&gt; node. Most of this information is available in the &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/index.html"&gt;DataFusion Physical Plan documentation&lt;/a&gt; or directly from &lt;a href="https://github.com/apache/arrow-datafusion"&gt;its repo&lt;/a&gt;. The &lt;code&gt;ExecutionPlans&lt;/code&gt; that are not in the DataFusion docs are InfluxDB specific—more information is available in this &lt;a href="https://github.com/influxdata/influxdb"&gt;InfluxDB repo&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Recall what the input data of the &lt;code&gt;Exec&lt;/code&gt; node looks like and how large/small it may be.&lt;/li&gt;
  &lt;li&gt;Consider how much data that &lt;code&gt;Exec&lt;/code&gt; node may send out and what it would look like.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Using these steps, you can estimate how much work a plan has to do. However, the &lt;code&gt;explain&lt;/code&gt; command shows you the plan without executing it. If you want to know exactly how long it takes a plan and each of its ExecutionPlan to execute, you need other tools.&lt;/p&gt;

&lt;h2 id="tools-that-show-the-exact-runtime-for-each-executionplan"&gt;Tools that show the exact runtime for each ExecutionPlan&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Run &lt;code&gt;EXPLAIN ANALYZE,&lt;/code&gt; to print out an ‘explain plan’ (see Figure 1) annotated with execution counters and information such as runtime and rows produced.&lt;/li&gt;
  &lt;li&gt;There are other tools, such as distributed tracing with Jaeger, which we will describe in a future post.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="more-information-for-debugging"&gt;More information for debugging&lt;/h2&gt;

&lt;p&gt;If the plan has to read many files, the &lt;code&gt;EXPLAIN&lt;/code&gt; report will not show all of them. To see all files, use &lt;code&gt;EXPLAIN VERBOSE.&lt;/code&gt; Like &lt;code&gt;EXPLAIN,&lt;/code&gt; &lt;code&gt;EXPLAIN VERBOSE&lt;/code&gt; does not run the query and won’t tell you the runtime. Instead, you get all information omitted from the &lt;code&gt;EXPLAIN&lt;/code&gt; report and all intermediate physical plans that the InfluxDB 3.0 querier and DataFusion generate before returning the final physical plan. This is very helpful for debugging because you can see when the plan adds or removes an I have just replaced operator with this ExecutionPlan and what InfluxDB and DataFusion are doing to optimize your query.&lt;/p&gt;

&lt;h2 id="example-of-a-typical-plan-for-leading-edge-data"&gt;Example of a typical plan for leading-edge data&lt;/h2&gt;

&lt;p&gt;Let’s delve into an example that covers typical ExecutionPlans as well as InfluxDB-specific ones on leading-edge data.&lt;/p&gt;

&lt;h3 id="data-organization"&gt;Data organization&lt;/h3&gt;

&lt;p&gt;To make it easier to explain the plan below, Figure 3 shows the data organization that the plan reads. Once you get used to reading query plans, you can figure this out from the plan itself. Some details to note:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;There may be more data in the system. This is just the data the query reads after applying the predicate of the query to prune out-of-bounds partitions.&lt;/li&gt;
  &lt;li&gt;Recently received data is being ingested and isn’t yet persisted. In the plan, the &lt;code&gt;RecordBatchesExec&lt;/code&gt; represents data from the ingester not yet persisted to Parquet files.&lt;/li&gt;
  &lt;li&gt;Four Parquet files are retrieved from storage and are represented by two &lt;code&gt;ParquetExec&lt;/code&gt; nodes containing two files each:
    &lt;ul&gt;
      &lt;li&gt;In the first node, two files, &lt;code&gt;file_1&lt;/code&gt; and &lt;code&gt;file_2,&lt;/code&gt; do not overlap in &lt;strong&gt;time&lt;/strong&gt; with any other files and do not have any duplicated data. Data within a file never has duplicates, so ‌deduplication is never necessary for non-overlapped files.&lt;/li&gt;
      &lt;li&gt;In the second node, two files, &lt;code&gt;file_3&lt;/code&gt; and &lt;code&gt;file_4,&lt;/code&gt; overlap with each other and with the ingesting data represented by the &lt;code&gt;RecordBatchesExec.&lt;/code&gt;&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/04d5ba6b62184d5fa5e36469056d3297/cfb0069da4bc6bbfe9640a8b4fc0b8bd/Reading_Query_Plans_Diagram_3_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
&lt;em&gt;Figure 3: Data of the query plan in Figure 4&lt;/em&gt;&lt;/p&gt;

&lt;h3 id="query-and-query-plan"&gt;Query and query plan&lt;/h3&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;EXPLAIN 
SELECT city, count(1) 
FROM   temperature 
WHERE  time &amp;gt;= to_timestamp(200) AND time &amp;lt; to_timestamp(700) 
AND state = 'MA' 
GROUP BY city 
ORDER BY city ASC;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/d42e630fcd2a4471bb1a916497eed3fc/c0ccdda7dfb5befb136f78bd72009368/Reading_Query_Plans_Diagram_4_01.17.24v1.png" alt="Reading Query Plans Diagram 4_01.17.24v1.png" /&gt;
&lt;em&gt;Figure 4: A typical query plan of leading-edge (most recent) data. Note: The colors in the left column correspond to the figures below.&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="reading-logical-plan"&gt;Reading logical plan&lt;/h4&gt;

&lt;p&gt;The logical plan in Figure 5 shows that the table scan occurs first and that the query predicates then filters the data. Next, the plan aggregates the data to compute the count of the number of rows per city. Finally, the plan sorts and returns the data.
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/d136a9c96a2c409088700da24c91bc54/20a51437bf1b0396679c5e395dbd0335/Reading_Query_Plans_Diagram_5_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 5: Logical plan from Figure 4&lt;/em&gt;&lt;/p&gt;

&lt;h4 id="reading-physical-plan"&gt;Reading physical plan&lt;/h4&gt;

&lt;p&gt;Let us begin reading from the bottom up. The bottom or leaf nodes are always either &lt;code&gt;ParquetExec&lt;/code&gt; or &lt;code&gt;RecordBatchExec&lt;/code&gt;. There are three of them in this plan, so let’s go over them one by one.&lt;/p&gt;

&lt;h4 id="the-three-bottom-leaves-consist-of-two-parquetexec-nodes-and-one-recordbatchesexec-node"&gt;The three bottom leaves consist of two &lt;code&gt;ParquetExec&lt;/code&gt; nodes and one &lt;code&gt;RecordBatchesExec&lt;/code&gt; node.&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;First &lt;code&gt;ParqetExec&lt;/code&gt;&lt;/strong&gt;
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/443d8e6fbe604092a42f5006be532c8e/0ea2f5263d3a2c3c131e5e9bbfa596ee/Reading_Query_Plans_Diagram_6_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 6: First ParquetExec&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;This &lt;code&gt;ParquetExec&lt;/code&gt; includes two groups of files. Each group can contain one or many files, but in this example, there is one file in each group. The node executes the groups in parallel and reads the files in each group sequentially. So, in this example, the two files are read in parallel.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;1/1/237/2cbb3992-4607-494d-82e4-66c480123189.parquet&lt;/code&gt;: this is the path of the file in object storage. It is in the structure &lt;code&gt;db_id/table_id/partition_hash_id/uuid_of_the_file.parquet&lt;/code&gt;, and each segment, respectively, tells us:
    &lt;ul&gt;
      &lt;li&gt;Which database and table are queried&lt;/li&gt;
      &lt;li&gt;Which partition the file belongs to (you can count how many partitions this query reads)&lt;/li&gt;
      &lt;li&gt;Which file it is&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;code&gt;projection=[__chunk_order, city, state, time]&lt;/code&gt;: there are many columns in this table, but the node only reads these four. The &lt;code&gt;__chunk_order&lt;/code&gt; column is an artificial column the InfluxDB code generates to keep the chunks/files ordered for deduplication.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;output_ordering=[state@2 ASC, city@1 ASC, time@3 ASC, __chunk_order@0 ASC]&lt;/code&gt;: this &lt;code&gt;ParquetExec&lt;/code&gt; node will sort its output on &lt;code&gt;state ASC, city ASC, time ASC, __chunk_order ASC&lt;/code&gt;. InfluxDB automatically sorts Parquet files when storing them to improve storage compression and query efficiency.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;predicate=time@5 &amp;gt;= 200 AND time@5 &amp;lt; 700 AND state@4 = MA&lt;/code&gt;: This is a filter in the query used for data pruning.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;pruning_predicate=time_max@0 &amp;gt;= 200 AND time_min@1 &amp;lt; 700 AND state_min@2 &amp;lt;= MA AND MA &amp;lt;= state_max@3&lt;/code&gt;: this is the actual pruning predicate transformed from the predicate above. It is used to filter files outside that predicate. At this time (Dec 2023), InfluxDB 3.0 only filters files based on &lt;code&gt;time.&lt;/code&gt; Note that this predicate is for pruning &lt;strong&gt;files from the chosen partitions&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RecordBatchesExec&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/ec05a891bde845a281153429961d1345/1b38fc013a351666de6ce952ab0b2f6c/Reading_Query_Plans_Diagram_7_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 7: RecordBatchesExec&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Data from the ingester can be in many chunks, but often, as in this example, there is only one. This node only sends data from four columns to the output, like the &lt;code&gt;ParquetExec&lt;/code&gt; node. We call the action of &lt;strong&gt;filtering columns&lt;/strong&gt; a &lt;strong&gt;projection pushdown&lt;/strong&gt;. It thus  has the name &lt;code&gt;projection&lt;/code&gt; in the query plan.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Second &lt;code&gt;ParquetExec&lt;/code&gt;&lt;/strong&gt;
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2323567c2c7246e09f48f42c7e3305be/08d4710a016916b2d021e76888ce3519/Reading_Query_Plans_Diagram_8_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 8: Second ParquetExec&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Reading the second &lt;code&gt;ParquetExec&lt;/code&gt; node is similar to the one above. Note that the files in both &lt;code&gt;ParquetExec&lt;/code&gt; nodes belong to the same partition (&lt;code&gt;237&lt;/code&gt;).&lt;/p&gt;

&lt;h4 id="data-scanning-structures"&gt;Data-scanning structures&lt;/h4&gt;

&lt;p&gt;Why do we send Parquet files from the same partition to different &lt;code&gt;ParquetExec&lt;/code&gt;? There are many reasons, but two major ones are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;To minimize the work required for deduplication by splitting the non-overlaps from the overlaps (which is the case in this example).&lt;/li&gt;
  &lt;li&gt;To improve parallelism by splitting the non-overlaps.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;How do we know that data overlaps?&lt;/strong&gt;
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/0d779aaf6b3c4bb0929e206d35ea7acd/9960eb11532c28ba7385e229da209023/Reading_Query_Plans_Diagram_9_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 9: DeduplicationExec is a signal of overlapped data&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DeduplicationExec&lt;/code&gt; in Figure 9 tells us that the preceding data (i.e., the data below it) overlaps. More specifically, data in two files overlaps and/or overlaps the data from the ingesters.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code&gt;FilterExec: time@3 &amp;gt;= 200 AND time@3 &amp;lt; 700 AND state@2 = MA&lt;/code&gt;: this is where we filter out everything that meets the conditions &lt;code&gt;time@3 &amp;gt;= 200 AND time@3 &amp;lt; 700 AND state@2 = MA&lt;/code&gt;. The previous operation only prunes data when possible. It does not guarantee the pruning of all data. We need this filter to perform complete and precise filtering.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;CoalesceBatchesExec: target_batch_size=8192&lt;/code&gt; is a way to group small data into larger groups if possible. Refer to the &lt;a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/index.html\"&gt;DataFusion documentation&lt;/a&gt; for how it works.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;SortExec: expr=[state@2 ASC,city@1 ASC,time@3 ASC,__chunk_order@0 ASC]&lt;/code&gt;: this sorts data on &lt;code&gt;state ASC, city ASC, time ASC, __chunk_order ASC&lt;/code&gt;. Note that this sort only applies to data from ingesters because data from Parquet files is already sorted in that order.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;UnionExec&lt;/code&gt; is simply a place to pull many streams together. It is fast to execute and does not merge anything.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;SortPreservingMergeExec: [state@2 ASC,city@1 ASC,time@3 ASC,__chunk_order@0 ASC]&lt;/code&gt;: this operation merges pre-sorted. When you see this, you know the data below it is already sorted and the output is in one stream.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;DeduplicateExec: [state@2 ASC,city@1 ASC,time@3 ASC]&lt;/code&gt;: this operation deduplicates sorted data strictly from one input stream. That is why you often see &lt;code&gt;SortPreservingMergeExec&lt;/code&gt; under &lt;code&gt;DeduplicateExec,&lt;/code&gt; but it is not required. As long as the input to &lt;code&gt;DeduplicateExec&lt;/code&gt; is a single stream of sorted data, it will work correctly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How do we know data doesn’t overlap?&lt;/strong&gt;
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/fb10bd19ad844a09a5ffeafd384b743d/a596e98d2993b5c092fa17df6dcf8b91/Reading_Query_Plans_Diagram_10_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 10: No &lt;code&gt;DeduplicateExec&lt;/code&gt; means files do not overlap&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When a &lt;code&gt;ParquetExec&lt;/code&gt; or &lt;code&gt;RecordBatchesExec&lt;/code&gt; branch doesn’t lead to a &lt;code&gt;DeduplicateExec,&lt;/code&gt; we know that the files handled by that &lt;code&gt;Exec&lt;/code&gt; don’t overlap.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code&gt;ProjectionExec: expr=[city@0 as city]&lt;/code&gt;: this filters column data and only sends out data from column &lt;code&gt;city.&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id="other-executionplans"&gt;Other ExecutionPlans&lt;/h4&gt;

&lt;p&gt;Now let’s look at the rest of the plan.
&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/4f6035dd4cc3458ab772b1d45bb47e43/2d1c7d91f7a7f8c7227edc548975b643/Reading_Query_Plans_Diagram_11_01.17.24v1.png" alt="" /&gt;
&lt;em&gt;Figure 11: The rest of the plan structure&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code&gt;UnionExec&lt;/code&gt;: unions data streams. Note that the number of output streams is the same as the number of input streams. The ExecutionPlan above is responsible for merging or splitting the streams further. This &lt;code&gt;UnionExec&lt;/code&gt; is an intermediate step of the merge/split.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=3&lt;/code&gt;: this splits three input streams into four output streams in a round-robin fashion. This cluster has four cores available, so this RepartitionExec partitions the data into four streams to increase parallel execution.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;AggregateExec: mode=Partial, gby=[city@0 as city], aggr=[COUNT(Int64(1))]&lt;/code&gt;: this groups data into groups that have the same values of &lt;code&gt;city&lt;/code&gt;. Because there are four input streams, each stream is aggregated separately, which creates four output streams. It also means that the output data is not fully aggregated as indicated by the &lt;code&gt;mode=Partial&lt;/code&gt; flag.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;RepartitionExec: partitioning=Hash([city@0], 4), input_partitions=4&lt;/code&gt;: this repartitions data on &lt;code&gt;hash(city)&lt;/code&gt; into four streams so that the same city goes into the same stream.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;AggregateExec: mode=FinalPartitioned, gby=[city@0 as city], aggr=[COUNT(Int64(1))]&lt;/code&gt;: because rows for the same city are in the same stream, we only need to do the final aggregation.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;SortExec: expr=[city@0 ASC NULLS LAST]&lt;/code&gt;: sort each of the four data streams on &lt;code&gt;city&lt;/code&gt; per the query request.&lt;/li&gt;
  &lt;li&gt;&lt;code&gt;SortPreservingMergeExec: [city@0 ASC NULLS LAST]&lt;/code&gt;: (sort) merge four sorted streams to return the final results.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you see that a plan reads many files and performs deduplication on all of them, you may ask: “Do all the files overlap or not?” The answer is either yes or no, depending on the situation. Sometimes, the compactor may be behind, and if you give it some time to compact small and overlapped files, your query will read fewer files faster. If there are still a lot of files, you may want to check the workload of your compactor and add more resources as needed. There are other reasons that we deduplicate non-overlap files due to memory limitations of your querier’s memory, but those are topics for a future blog post.&lt;/p&gt;

&lt;h1 id="conclusion"&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;EXPLAIN&lt;/code&gt; is a way to understand how InfluxDB executes your query and why it’s fast or slow. You can often rewrite your query to add more filters or remove unnecessary sorting (&lt;code&gt;order by&lt;/code&gt; in the query) to make your query run faster. Other times, queries are slow because your system lacks resources. In that case, it’s time to reassess the cluster configuration or consult the InfluxDB support team.&lt;/p&gt;
</description>
      <pubDate>Mon, 29 Jan 2024 08:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/how-read-influxdb-3-query-plans/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/how-read-influxdb-3-query-plans/</guid>
      <category>Developer</category>
      <author>Nga Tran (InfluxData)</author>
    </item>
    <item>
      <title>Partitioning Data for Query Performance in InfluxDB 3</title>
      <description>&lt;p&gt;Query performance is critical in any database. Data partitioning is a mechanism that helps prune unnecessary data, allowing queries to run faster. However, there are always trade-offs between large and small numbers of partitions. For instance, fine-grained partitioning on high cardinality columns can reduce performance. This post describes different partitioning schemes supported by &lt;a href="https://www.influxdata.com/products/influxdb-overview/"&gt;InfluxDB 3&lt;/a&gt; and explains their trade-offs.&lt;/p&gt;

&lt;p&gt;Only InfluxData’s &lt;a href="https://www.influxdata.com/products/influxdb-cloud/dedicated/"&gt;Cloud Dedicated&lt;/a&gt; and &lt;a href="https://www.influxdata.com/products/influxdb-clustered/"&gt;Clustered&lt;/a&gt; products support user-defined partitioning. Note that InfluxDB &lt;a href="https://www.influxdata.com/products/influxdb-cloud/serverless/"&gt;Cloud Serverless&lt;/a&gt; always partitions data by day and there is no option to modify that.&lt;/p&gt;

&lt;h1 id="default-partitioning"&gt;Default partitioning&lt;/h1&gt;

&lt;p&gt;Because InfluxDB is a time series database, it filters most queries by a time range. If you load data without specifying how your table is partitioned, by default, InfluxDB partitions your data by day, as shown in Figure 1. It stores all data with &lt;code&gt;time&lt;/code&gt; on the same day in the same partition. In practice, this is a good partitioning scheme for most &lt;strong&gt;moderate-volume&lt;/strong&gt; use cases and helps balance ingest efficiency and query performance.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/0971dcdb6b0b48bbabcc26957b85f128/142079d4ea6d82aaf44156e2d1565235/Partitioning_Data_for_Query_Performance_Diagram_1_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
  &lt;em&gt;Figure 1: data of my_table partitioned by day&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Queries that select a specific time range only need to read data from the relevant partitions. For example, the following SQL query only needs to read data from two partitions, &lt;code&gt;2025-01-01&lt;/code&gt; and &lt;code&gt;2025-01-02.&lt;/code&gt;InfluxDB 3 uses information from its catalog to avoid reading any other partitions.&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT ...
FROM   my_table
WHERE  time &amp;gt;= '2025-01-01 18:00:00' AND time &amp;lt;= '2025-01-02 03:00:00';
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;Query 1: filtering data on a time range&lt;/em&gt;&lt;/p&gt;

&lt;h1 id="user-defined--custom-partitioning"&gt;User-defined / custom partitioning&lt;/h1&gt;

&lt;p&gt;For cases when a single day contains too much data (e.g., GBs of &lt;a href="https://www.influxdata.com/glossary/apache-parquet/"&gt;Parquet&lt;/a&gt; files), and the query includes filters on additional tag columns, InfluxDB allows you to partition your data on your tag(s) and time. For instance, if your common queries request data for a specific city in a specific time range, as in the following query, you can use custom partitioning to partition the data on &lt;code&gt;city_name&lt;/code&gt; and &lt;code&gt;time.&lt;/code&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-sql"&gt;SELECT ...
FROM   my_table
WHERE  time &amp;gt;= '2025-01-01 18:00:00' AND time &amp;lt;= '2025-01-02 03:00:00'
   AND city_name = `Boston`;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;em&gt;Query 2: filtering data for a city and a time range&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you partition data on &lt;code&gt;city_name&lt;/code&gt; and &lt;code&gt;day,&lt;/code&gt;you will have &lt;strong&gt;more partitions&lt;/strong&gt; and they will look like this:&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/fc81cd68a9484fdeb69a6f2d4d97d812/8f9ccfc4ec31e010df826474d5682515/Partitioning_Data_for_Query_Performance_Diagram_2_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
  &lt;em&gt;Figure 2: data of my_table partitioned by city_name and day&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Query 2 only needs to read two partitions, &lt;code&gt;Boston | 2025-01-01&lt;/code&gt; and &lt;code&gt;Boston | 2025-01-02.&lt;/code&gt; If your data contains many cities, each partition will be smaller than the day-default partition, and you will need less data for your query.&lt;/p&gt;

&lt;p&gt;Note that your custom partitioning tag columns must always have values for InfluxDB to store them in the correct partitions. Without a value, InfluxDB won’t have enough information to apply filters, and your query will end up reading all partitions.&lt;/p&gt;

&lt;h1 id="the-cost-of-having-too-many-partitions"&gt;The cost of having too many partitions&lt;/h1&gt;

&lt;p&gt;Partition designs always come with trade-offs. While smaller partitions help reduce the amount of data your query reads, it does not always mean your query will run faster. There are also side effects on ingester and compactor workloads. Having smaller partitions usually means having more, smaller Parquet files and will lead to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Less storage efficiency—more files require more space to store the same data.&lt;/li&gt;
  &lt;li&gt;A higher ingester workload is required to group data into smaller partitions and files.&lt;/li&gt;
  &lt;li&gt;A higher compactor workload is required to compact more partitions and smaller files.&lt;/li&gt;
  &lt;li&gt;A higher metadata catalog volume—more partitions and a higher number of files require more pruning processing at query time.&lt;/li&gt;
  &lt;li&gt;Queries that do not have predicates and cover your entire partition design may end up reading many partitions and smaller files, degrading performance.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Below are the schemes for you to control the number of your partitions.&lt;/p&gt;

&lt;h1 id="user-controlled-number-of-partitions"&gt;User-controlled number of partitions&lt;/h1&gt;

&lt;p&gt;If you want to control the number of partitions for a data set containing many cities but there is not much data for each city, you may consider partitioning your data over a greater time range. Figure 3 illustrates data partitioned by &lt;code&gt;city_name&lt;/code&gt; and &lt;code&gt;month.&lt;/code&gt; In this case, Query 2 will read data from one partition, &lt;code&gt;Boston | 2025-01,&lt;/code&gt; which covers the entire month of data for Boston.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/6d0a1a784c644afbb1deed1b0ec50506/e9310a09ca04c315635c6d8a8548f32e/Partitioning_Data_for_Query_Performance_Diagram_3_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
  &lt;em&gt;Figure 3:  data of my_table partitioned by city_name and month&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Note: Defining a retention policy for a specific time allows you to control the number of partitions in the database.&lt;/p&gt;

&lt;h1 id="user-defined-number-of-partitions"&gt;User-defined number of partitions&lt;/h1&gt;

&lt;p&gt;At this time (Jan 2024), InfluxData is working on a feature named &lt;strong&gt;Server-Side Bucketing&lt;/strong&gt;, which will provide users with a simpler method for setting their desired number of partitions. For example, if you don’t know how many cities will be in your data set but you know there will be many, you can cap the number of partitions by hashing many of them into the same partition. Figure 4 shows data partitioned by &lt;code&gt;hash(city_name) % 10&lt;/code&gt; and &lt;code&gt;day.&lt;/code&gt; In this case, there will be a max of 10 * 365 = 3,650 partitions for a year of data.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/93b968c691424bc7bba1d990aa7ab7c9/0c4c3f85d3e0407354f13251ce22c9b2/Partitioning_Data_for_Query_Performance_Diagram_4_01.17.24v1.png" alt="" /&gt;
&lt;br /&gt;
  &lt;em&gt;Figure 4:  data of my_table partitioned by hash(city_name) % 10 and day.&lt;/em&gt;&lt;/p&gt;

&lt;h1 id="final-thoughts"&gt;Final thoughts&lt;/h1&gt;

&lt;p&gt;Finding the right partitioning plan for your data helps increase your query performance, especially for high-volume ingest cases. However, you need to understand the nature and scale of your data to implement the right design and to ensure you don’t have too many partitions in your system. Remember to avoid partitioning your data on &lt;em&gt;optional&lt;/em&gt; tag values. If your data lacks a value, InfluxDB won’t have enough information to apply filters, and your query will end up reading all partitions.&lt;/p&gt;

&lt;p&gt;Refer to &lt;a href="https://www.influxdata.com/blog/partitioning-performance-sharding-database-system/"&gt;Partitioning for Performance in a Sharding Database System&lt;/a&gt; for other roles of data partitioning.&lt;/p&gt;
</description>
      <pubDate>Fri, 19 Jan 2024 08:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/partitioning-data-query-performance-influxdb-3/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/partitioning-data-query-performance-influxdb-3/</guid>
      <category>Developer</category>
      <author>Nga Tran (InfluxData)</author>
    </item>
    <item>
      <title>InfluxDB 3: System Architecture</title>
      <description>&lt;p&gt;InfluxDB 3 (previously known as InfluxDB IOx) is a (cloud) scalable database that offers high performance for both data loading and querying, and focuses on time series use cases. This article describes the system architecture of the database.&lt;/p&gt;

&lt;p&gt;Figure 1 shows the architecture of InfluxDB 3 that includes four major components and two main storages.&lt;/p&gt;

&lt;p&gt;The four components each operate almost independently and are responsible for:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;&lt;strong&gt;data ingestion&lt;/strong&gt;&lt;/em&gt; illustrated in blue,&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;&lt;strong&gt;data querying&lt;/strong&gt;&lt;/em&gt; demonstrated in green,&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;&lt;strong&gt;data compaction&lt;/strong&gt;&lt;/em&gt; shown in red, and&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;&lt;strong&gt;garbage collection&lt;/strong&gt;&lt;/em&gt; drawn in pink respectively.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the two storage types, one is dedicated to the cluster metadata named &lt;strong&gt;Catalog&lt;/strong&gt; and the other is a lot larger and stores the actual data and named &lt;strong&gt;Object Storage&lt;/strong&gt;, such as Amazon AWS S3. In addition to these main storage locations, there are much smaller data stores called &lt;strong&gt;Write Ahead Log&lt;/strong&gt; (WAL) used by the ingestion component only for crash recovery during data loading.&lt;/p&gt;

&lt;p&gt;The arrows in the diagram show the data flow direction; how to communicate for pulling or pushing the data is beyond the scope of this article. For data already persisted, we designed the system to have the Catalog and Object Storage as the only state and enable each component to only read these storages without the need to communicate with other components. For the not-yet-persisted data, the data ingestion component manages the state to send to the data querying component when a query arrives. Let us delve into this architecture by going through each component one-by-one.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/50Gb9y9OuHd42InqbcfWRl/defa279fd370649385e718eefde61f2d/Figure_1_InfluxDB_3-0_Architecture.png" alt="Figure 1 InfluxDB 3-0 Architecture" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 1: InfluxDB 3.0 Architecture&lt;/figcaption&gt;

&lt;h2 id="data-ingestion"&gt;Data ingestion&lt;/h2&gt;

&lt;p&gt;Figure 2 demonstrates the design of the data ingestion in InfluxDB 3. Users write data to the &lt;em&gt;&lt;strong&gt;Ingest Router&lt;/strong&gt;&lt;/em&gt; which shards the data to one of the &lt;em&gt;&lt;strong&gt;Ingesters&lt;/strong&gt;&lt;/em&gt;. The number of the ingesters in the cluster can be scaled up and down depending on the data workload. We use &lt;a href="https://www.infoworld.com/article/3656915/scaling-throughput-and-performance-in-a-sharding-database-system.html"&gt;these scaling principles&lt;/a&gt; to shard the data. Each ingester has an attached storage, such as Amazon EBS, used as a write ahead log (WAL) for crash recovery.&lt;/p&gt;

&lt;p&gt;Each ingester performs these major steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Identify tables of the data&lt;/strong&gt;: Unlike many other databases, users do not need to define their tables and their column schema before loading data into InfluxDB. They will be discovered and implicitly added by the ingester.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Validate data schema&lt;/strong&gt;: The data types provided in a user’s write are strictly validated synchronously with the write request. This prevents type conflicts propagating to the rest of the system and provides the user with instantaneous feedback.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Partition the data&lt;/strong&gt;: In a large-scale database such as InfluxDB, there are a lot of &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;benefits to partitioning the data&lt;/a&gt;. The ingester is responsible for the partitioning job and currently it partitions the data by day on the ‘time’ column. If the ingesting data has no time column, the Ingest Router implicitly adds it and sets its value as the data loading time.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Deduplicate the data&lt;/strong&gt;: In time series use cases, it is common to see the same data ingested multiple times, so InfluxDB 3 performs the &lt;a href="https://www.infoworld.com/article/3683915/using-deduplication-for-eventually-consistent-transactions.html"&gt;deduplication process&lt;/a&gt;. The ingester builds an efficient multi-column sort merge plan for the deduplication job. Because InfluxDB uses &lt;a href="https://github.com/apache/arrow-datafusion"&gt;DataFusion&lt;/a&gt; for its Query Execution and &lt;a href="https://github.com/apache/arrow-rs"&gt;Arrow&lt;/a&gt; as its internal data representation, building a sort merge plan involves simply putting DataFusion’s sort and merge operators together. Running that sort merge plan effectively &lt;a href="https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/"&gt;on multiple columns&lt;/a&gt; is part of the work the InfluxDB team contributed to DataFusion.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Persist the data&lt;/strong&gt;: The processed and sorted data then persists as a &lt;a href="https://www.influxdata.com/glossary/apache-parquet/"&gt;Parquet&lt;/a&gt; file. Because data is encoded/compressed very effectively if it is sorted on the least cardinality columns, the ingester finds and picks the least cardinality columns for the sort order of the sort mentioned above. As a result, the size of the file is often 10-100x smaller than its raw form.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Update the Catalog&lt;/strong&gt;: The ingester then updates the Catalog about the existence of the newly created file. This is a signal to let the other two components, &lt;strong&gt;Querier&lt;/strong&gt; and &lt;strong&gt;Compactor&lt;/strong&gt;, know that new data has arrived.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though the ingester performs many steps, InfluxDB 3 optimizes the write path, keeping write latency minimal, on the order of milliseconds. This may lead to a lot of small files in the system. However, we do not keep them around for long. The compactors, described in a later section, compact these files in the background.&lt;/p&gt;

&lt;p&gt;The ingesters also support fault tolerance, which is beyond the scope of this article. The detailed design and implementation of ingesters deserve their own blog posts.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/220V0dp3oe4CJtVz86VyEv/177b8e4a9b33f0df1c0c417fc2365cce/Figure_2_Data_Ingestion.png" alt="Figure 2 Data Ingestion" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 2: Data Ingestion&lt;/figcaption&gt;

&lt;h2 id="data-querying"&gt;Data querying&lt;/h2&gt;

&lt;p&gt;Figure 3 shows how InfluxDB 3 queries data. Users send a &lt;a href="https://www.influxdata.com/glossary/sql/"&gt;SQL&lt;/a&gt; or an &lt;a href="https://www.influxdata.com/products/influxql/"&gt;InfluxQL&lt;/a&gt; query to the &lt;strong&gt;Query Router&lt;/strong&gt; that forwards them to a &lt;strong&gt;Querier&lt;/strong&gt;, which reads needed data, builds a plan for the query, runs the plan, and returns the result back to the users. The number of queriers can be scaled up and down depending on the query workload using the same &lt;a href="https://www.infoworld.com/article/3656915/scaling-throughput-and-performance-in-a-sharding-database-system.html"&gt;scaling principles&lt;/a&gt; used in the design of the ingesters.&lt;/p&gt;

&lt;p&gt;Each querier performs these major tasks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Cache metadata&lt;/strong&gt;: To support high query workload effectively, the querier keeps synchronizing its metadata cache with the central catalog to have up-to-date tables and their ingested metadata.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Read and cache data&lt;/strong&gt;: When a query arrives, if its data is not available in the querier’s data cache, the querier reads the data into the cache first because we know from statistics that the same files will be read multiple times. Querier only caches the content of the file needed to answer the query; the other part of the file that the query does not need based on the querier’s pruning strategy is never cached.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Get not-yet-persisted data from ingesters&lt;/strong&gt;: Because there may be data in the ingesters not yet persisted into the Object Storage, the querier must communicate with the corresponding ingesters to get that data. From this communication, the querier also learns from the ingester whether there are newer tables and data to invalidate and update its caches to have an up-to-date view of the whole system.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Build and execute an optimal query plan&lt;/strong&gt;: Like many other databases, the InfluxDB 3 Querier contains a Query Optimizer. The querier builds the best-suited query plan (aka optimal plan) that executes on the data from the cache and ingesters, and finishes in the least amount of time. Similar to the design of the ingester, the querier uses &lt;a href="https://www.influxdata.com/glossary/apache-datafusion/"&gt;DataFusion&lt;/a&gt; and &lt;a href="https://www.influxdata.com/glossary/apache-arrow/"&gt;Arrow&lt;/a&gt; to build and execute custom query plans for SQL (and soon InfluxQL). The querier takes advantage of the &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;data partitioning&lt;/a&gt; done in the ingester to parallelize its query plan and prune unnecessary data before executing the plan. The querier also applies common techniques of &lt;a href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/"&gt;predicate and projection pushdown&lt;/a&gt; to further prune data as soon as possible.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though data in each file does not contain duplicates itself, data in different files and data that is not yet persisted sent to the querier from the ingesters may include duplicates. Thus the deduplication process is also necessary at query time. Similar to the ingester, the querier uses the same multi-column sort merge operators described above for the deduplication job. Unlike the plan built for the ingester, these operators are just a part of a bigger and more complex query plan built to execute the query. This ensures the data streams through the rest of the plan after deduplication.&lt;/p&gt;

&lt;p&gt;It is worth noting that even with an advanced multi-column sort merge operator, its execution cost is not trivial. The querier optimizes further the plan to only deduplicate overlapped files in which duplicates may happen. Furthermore, to provide high query performance in the querier, InfluxDB 3 avoids as much deduplication as possible during query time by compacting data beforehand. The next section describes the compaction process.&lt;/p&gt;

&lt;p&gt;The detailed design and implementation of the querier tasks described briefly above deserve their own blog posts.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/27K92eoGIxT5oEzrobtdWz/7244b1ac3940f9f6fe620300707c45a8/Figure_3_Data_Querying.png" alt="Figure 3 Data Querying" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 3: Data Querying&lt;/figcaption&gt;

&lt;h2 id="data-compaction"&gt;Data compaction&lt;/h2&gt;

&lt;p&gt;As described in the “Data ingestion” section, to reduce the ingest latency, the amount of data processed and persisted into each file by an ingester is very minimal. This leaves many small files stored in the Object Storage which in turn create significant I/O during query time and reduce the query performance. Furthermore, as discussed in the “Data querying” section, overlapped files may contain duplicates that need deduplication during query time, which reduces query performance. The job of data compaction is to compact many small files ingested by the ingesters to fewer, larger, and non-overlapped files to gain query performance.&lt;/p&gt;

&lt;p&gt;Figure 4 illustrates the architecture of the data compaction, which includes one or many &lt;strong&gt;Compactors&lt;/strong&gt;. Each compactor runs a background job that reads newly ingested files and compacts them together into fewer, larger, and non-overlapped files. The number of compactors can be scaled up and down depending on the compacting workload, which is a function of the number of tables with new data files, the number of new files per table, how large the files are, how many existing files the new files overlap with, and how wide a table is (aka how many columns are in a table).&lt;/p&gt;

&lt;p&gt;In the article, &lt;a href="https://www.infoworld.com/article/3685496/compactor-a-hidden-engine-of-database-performance.html"&gt;Compactor: A hidden engine of database performance&lt;/a&gt;, we described the detailed tasks of a compactor: how it builds an optimized deduplication plan that merges data files, the sort order of different-column files that helps with the deduplication, using compaction levels to achieve non-overlapped files while minimizing recompactions, and building an optimized deduplication plan on a mix of non-overlapped and overlapped files in the querier.&lt;/p&gt;

&lt;p&gt;Like the design of the ingester and querier, the compactor uses DataFusion and Arrow to build and execute custom query plans. Actually, all three components share the same compaction sub-plan that covers both data deduplication and merge.&lt;/p&gt;

&lt;p&gt;The small and/or overlapped files compacted into larger and non-overlapped files must be deleted to reclaim space. To avoid deleting a file that is being read by a querier, the compactor never hard deletes any files. Instead, it marks the files as soft deleted in the catalog, and another background service named Garbage Collector eventually deletes the soft deleted files to reclaim storage.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/2ulVRsGGz2683r4VHRE4hT/a894b1cc777a0592f42748e15877c943/Figure_4_Data_Compaction.png" alt="Figure 4 Data Compaction" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 4: Data Compaction&lt;/figcaption&gt;

&lt;h2 id="garbage-collection"&gt;Garbage collection&lt;/h2&gt;

&lt;p&gt;Figure 5 illustrates the design of InfluxDB 3.0 garbage collection that is responsible for data retention and space reclamation. &lt;strong&gt;Garbage Collector&lt;/strong&gt; runs background jobs that schedule to soft and hard delete data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data retention:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;InfluxDB provides an option for users to define their data retention policy and save it in the catalog. The scheduled background job of the garbage collector reads the catalog for tables that are outside the retention period and marks their files as soft deleted in the catalog. This signals the queriers and compactors that these files are no longer available for querying and compacting, respectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Space reclamation:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another scheduled background job of the garbage collector reads the catalog for metadata of the files that were soft deleted a certain time ago. It then removes the corresponding data files from the Object Storage and also removes the metadata from the Catalog.&lt;/p&gt;

&lt;p&gt;Note that the soft deleted files came from different sources: compacted files deleted by the compactors, files outside the retention period deleted by the garbage collector itself, and files deleted through a delete command that InfluxDB 3 plans to support in the future. The hard delete job does not need to know where the soft deletes come from and treats them all the same.&lt;/p&gt;

&lt;p&gt;Soft and hard deletes are another large topic that involves the work in the ingesters, queriers, compactors, and garbage collectors and deserve their own blog post.&lt;/p&gt;

&lt;p&gt;&lt;img style="padding: 20px 0px;" src="//images.ctfassets.net/o7xu9whrs0u9/4jDBsLqSdR0QwObS9qO54O/32079f75b992d52925b59b8092ee8a6c/Figure_5_Garbage_Collection.png" alt="Figure 5 Garbage Collection" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 5: Garbage Collection&lt;/figcaption&gt;

&lt;h2 id="influxdb-3-cluster-setup"&gt;InfluxDB 3 cluster setup&lt;/h2&gt;

&lt;p&gt;Other than the queriers making requests to their corresponding ingesters for not-yet-persisted data, the four components do not talk with each other directly. All communication is done via the Catalog and Object Storage. The ingesters and queriers do not even know of the existence of the compactors and garbage collector. However, as emphasized above, InfluxDB 3 is designed to have all four components co-exist to deliver a high performance database.&lt;/p&gt;

&lt;p&gt;In addition to those major components, InfluxDB also has other services such as &lt;strong&gt;Billing&lt;/strong&gt; to bill customers based on their usage.&lt;/p&gt;

&lt;h2 id="catalog-storage"&gt;Catalog Storage&lt;/h2&gt;

&lt;p&gt;InfluxDB 3.0 Catalog includes metadata of the data such as database (aka namespace), tables, columns, and file information (e.g., the file location, size, row count, etc …). InfluxDB uses a Postgres compatible database to manage its catalog. For example, local cluster setup can use PostgreSQL while the AWS cloud setup can use Amazon RDS.&lt;/p&gt;

&lt;h2 id="object-storage"&gt;Object Storage&lt;/h2&gt;

&lt;p&gt;InfluxDB 3 data storage only contains Parquet files which can be stored on local disk for local setup and in Amazon S3 for AWS cloud setup. The database also works on Azure Blob Storage and Google Cloud Storage.&lt;/p&gt;

&lt;h2 id="influxdb-3-cluster-operation"&gt;InfluxDB 3 cluster operation&lt;/h2&gt;

&lt;p&gt;InfluxDB 3 customers can set up multiple dedicated clusters, each operating independently to avoid “noisy neighbor” issues and contain potential reliability problems. Every cluster utilizes its own dedicated computational resources and can function on single or multiple Kubernetes clusters. This isolation also contains the potential blast radius of reliability issues that could emerge within a cluster due to activities in another.&lt;/p&gt;

&lt;p&gt;Our innovative approach to infrastructure upgrades utilizes in-place updates of entire Kubernetes clusters. The fact that most of the state in the InfluxDB 3 cluster is stored outside the Kubernetes clusters, such as in S3 and RDS, facilitates this process.&lt;/p&gt;

&lt;p&gt;Our platform engineering system allows us to orchestrate operations across hundreds of clusters and offers customers control over specific cluster parameters that govern performance and costs. Continuous monitoring of each cluster’s health is part of our operations, allowing a small team to manage numerous clusters effectively in a rapidly evolving software environment.&lt;/p&gt;
</description>
      <pubDate>Tue, 27 Jun 2023 07:35:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/influxdb-3-0-system-architecture/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/influxdb-3-0-system-architecture/</guid>
      <category>Product</category>
      <author>Nga Tran, Paul Dix, Andrew Lamb, Marko Mikulicic (InfluxData)</author>
    </item>
    <item>
      <title>Compactor: A Hidden Engine of Database Performance</title>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published in &lt;a href="https://www.infoworld.com/article/3685496/compactor-a-hidden-engine-of-database-performance.html"&gt;InfoWorld&lt;/a&gt; and is reposted here with permission.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The compactor handles critical post-ingestion and pre-query workloads in the background on a separate server, enabling low latency for data ingestion and high performance for queries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The demand for high volumes of data has increased the need for databases that can handle both data ingestion and querying with the lowest possible latency (aka high performance). To meet this demand, database designs have shifted to prioritize minimal work during ingestion and querying, with other tasks being performed in the background as post-ingestion and pre-query.&lt;/p&gt;

&lt;p&gt;This article will describe those tasks and how to run them in a completely different server to avoid sharing resources (CPU and memory) with servers that handle data loading and reading.&lt;/p&gt;

&lt;h2 id="tasks-of-post-ingestion-and-pre-query"&gt;Tasks of post-ingestion and pre-query&lt;/h2&gt;

&lt;p&gt;The tasks that can proceed after the completion of data ingestion and before the start of data reading will differ depending on the design and features of a database. In this post, we describe the three most common of these tasks: data file merging, delete application, and data deduplication.&lt;/p&gt;

&lt;h3 id="data-file-merging"&gt;Data file merging&lt;/h3&gt;

&lt;p&gt;Query performance is an important goal of most databases, and good query performance requires data to be well organized, such as sorted and encoded (aka compressed) or indexed. Because  &lt;a href="http://paperhub.s3.amazonaws.com/b96c89e3bc63ebb074728bb776b72e23.pdf"&gt;query processing can handle encoded data without decoding it&lt;/a&gt;, and the less I/O a query needs to read the faster it runs, encoding a large amount of data into a few large files is clearly beneficial. In a traditional database, the process that organizes data into large files is performed during load time by merging ingesting data with existing data. Sorting and encoding or indexing are also needed during this data organization. Hence, for the rest of this article, we’ll discuss the sort, encode, and index operations hand in hand with the file merge operation.&lt;/p&gt;

&lt;p&gt;Fast ingestion has become more and more critical to handling large and continuous flows of incoming data and near real-time queries. To support fast performance for both data ingesting and querying, newly ingested data is not merged with the existing data at load time but stored in a small file (or small chunk in memory in the case of a database that only supports in-memory data). The file merge is performed in the background as a post-ingestion and pre-query task.&lt;/p&gt;

&lt;p&gt;A variation of  &lt;a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree"&gt;LSM tree&lt;/a&gt;  (log-structured merge-tree) technique is usually used to merge them. With this technique, the small file that stores the newly ingested data should be organized (e.g. sorted and encoded) the same as other existing data files, but because it is a small set of data, the process to sort and encode that file is trivial. The reason to have all files organized the same will be explained in the section on data compaction below.&lt;/p&gt;

&lt;p&gt;Refer to this  &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;article on data partitioning&lt;/a&gt;  for examples of data-merging benefits.&lt;/p&gt;

&lt;h3 id="delete-application"&gt;Delete application&lt;/h3&gt;

&lt;p&gt;Similarly, the process of data deletion and update needs the data to be reorganized and takes time, especially for large historical datasets. To avoid this cost, data is not actually deleted when a delete is issued but a tombstone is added into the system to ‘mark’ the data as ‘soft deleted’. The actual delete is called ‘hard delete’ and will be done in the background.&lt;/p&gt;

&lt;p&gt;Updating data is often implemented as a delete followed by an insert, and hence, its process and background tasks will be the ones of the data ingestion and deletion.&lt;/p&gt;

&lt;h3 id="data-deduplication"&gt;Data deduplication&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/time-series-database/"&gt;Time series databases&lt;/a&gt;  such as InfluxDB accept ingesting the same data more than once but then apply deduplication to return non-duplicate results. Specific examples of deduplication applications can be found in  &lt;a href="https://www.infoworld.com/article/3683915/using-deduplication-for-eventually-consistent-transactions.html"&gt;this article on deduplication&lt;/a&gt;. Like the process of data file merging and deletion, the deduplication will need to reorganize data and thus is an ideal task for performing in the background.&lt;/p&gt;

&lt;h2 id="data-compaction"&gt;Data compaction&lt;/h2&gt;

&lt;p&gt;The background tasks of post-ingestion and pre-query are commonly known as data compaction because the output of these tasks typically contains less data and is more compressed. Strictly speaking, the “compaction” is a background loop that finds the data suitable for compaction and then compacts it. However, because there are many related tasks as described above, and because these tasks usually touch the same data set, the compaction process performs all of these tasks in the same query plan. This query plan scans data, finds rows to delete and deduplicate, and then encodes and indexes them as needed.&lt;/p&gt;

&lt;p&gt;Figure 1 shows a query plan that compacts two files. A query plan in the database is usually executed in a streaming/pipelining fashion from the bottom up, and each box in the figure represents an execution operator. First, data of each file is scanned concurrently. Then tombstones are applied to filter deleted data. Next, the data is sorted on the primary key (aka deduplication key), producing a set of columns before going through the deduplication step that applies a merge algorithm to eliminate duplicates on the primary key. The output is then encoded and indexed if needed and stored back in one compacted file. When the compacted data is stored, the metadata of File 1 and File 2 stored in the database catalog can be updated to point to the newly compacted data file and then File 1 and File 2 can be safely removed. The task to remove files after they are compacted is usually performed by the database’s garbage collector, which is beyond the scope of this article.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/24kV45RnvFxoqfd7L8vZLL/1f4c6efe3522d1d4dad165d50848b80b/Data_Compaction.png" alt="Data Compaction" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 1: The process of compacting two files&lt;/figcaption&gt;

&lt;p&gt;Even though the compaction plan in Figure 1 combines all three tasks in one scan of the data and avoids reading the same set of data three times, the plan operators such as filter and sort are still not cheap. Let us see whether we can avoid or optimize these operators further.&lt;/p&gt;

&lt;h3 id="optimized-compaction-plan"&gt;Optimized compaction plan&lt;/h3&gt;

&lt;p&gt;Figure 2 shows the optimized version of the plan in Figure 1. There are two major changes:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The operator Filter Deleted Data is pushed into the Scan operator. This is an effective  &lt;a href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/"&gt;predicate-push-down&lt;/a&gt;  way to filter data while scanning.&lt;/li&gt;
  &lt;li&gt;We no longer need the Sort operator because the input data files are already sorted on the primary key during data ingestion. The Deduplicate &amp;amp; Merge operator is implemented to keep its output data sorted on the same key as its inputs. Thus, the compacting data is also sorted on the primary key for future compaction if needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5P8VJTs3YDJKJ6DbYABXwz/acbca08685f59a3cc16ad1c573599021/Optimized_process_of_compacting_two_sorted_files.png" alt="Optimized process of compacting two sorted files" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 2: Optimized process of compacting two sorted files&lt;/figcaption&gt;

&lt;p&gt;Note that, if the two input files contain data of different columns, which is common in some databases such as InfluxDB, we will need to keep their sort order compatible to avoid doing a re-sort. For example, let’s say the primary key contains columns a, b, c, d, but File 1 includes only columns a, c, d (as well as other columns that are not a part of the primary key) and is sorted on a, c, d. If the data of File 2 is ingested after File 1 and includes columns a, b, c, d, then its sort order must be compatible with File 1’s sort order a, c, d. This means column b could be placed anywhere in the sort order, but c must be placed after a and d must be placed after c. For implementation consistency, the new column, b, could always be added as the last column in the sort order. Thus the sort order of File 2 would be a, c, d, b.&lt;/p&gt;

&lt;p&gt;Another reason to keep the data sorted is that, in a column-stored format such as Parquet and ORC, encoding works well with sorted data. For the common  &lt;a href="https://en.wikipedia.org/wiki/Run-length_encoding"&gt;RLE encoding&lt;/a&gt;, the lower the cardinality (i.e., the lower the number of distinct values), the better the encoding. Hence, putting the lower-cardinality columns first in the sort order of the primary key will not only help compress data more on disk but more importantly help the query plan to execute faster. This is because the data is kept encoded during execution, as described in  &lt;a href="http://paperhub.s3.amazonaws.com/b96c89e3bc63ebb074728bb776b72e23.pdf"&gt;this paper on materialization strategies&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id="compaction-levels"&gt;Compaction levels&lt;/h3&gt;

&lt;p&gt;To avoid the expensive deduplication operation, we want to manage the data files in a way that we know whether they potentially share duplicate data with other files or not. This can be done by using the technique of data overlapping. To simplify the examples of the rest of this article, we will assume that the data sets are time series in which data overlapping means that their data overlap on time. However, the overlap technique could be defined on non-time series data, too.&lt;/p&gt;

&lt;p&gt;One of the strategies to avoid recompacting well-compacted files is to define levels for the files. Level 0 represents newly ingested small files and Level 1 represents compacted, non-overlapping files. Figure 3 shows an example of files and their levels before and after the first and second rounds of compaction. Before any compaction, all of the files are Level 0 and they potentially overlap in time in arbitrary ways. After the first compaction, many small Level 0 files have been compacted into two large, non-overlapped Level 1 files. In the meantime (remember this is a background process), more small Level 0 files have been loaded in, and these kick-start a second round of compaction that compacts the newly ingested Level 0 files into the second Level 1 file. Given our strategy to keep Level 1 files always non-overlapped, we do not need to recompact Level 1 files if they do not overlap with any newly ingested Level 0 files.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/7KbaLRloQqOkBjV07oq9fL/e430a9990e30d76d4ba75be625db5296/Ingested_and_compacted_files_after_2_times_of_compaction.png" alt="Ingested and compacted files after 2 times of compaction" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 3: Ingested and compacted files after two rounds of compaction&lt;/figcaption&gt;

&lt;p&gt;If we want to add different levels of file size, more compaction levels (2, 3, 4, etc.) could be added. Note that, while files of different levels may overlap, no files should overlap with other files in the same level.&lt;/p&gt;

&lt;p&gt;We should try to avoid deduplication as much as possible, because the deduplication operator is expensive. Deduplication is especially expensive when the primary key includes many columns that need to be kept sorted. Building fast and memory efficient multi-column sorts is critically important. Some common techniques to do so are described  &lt;a href="https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/"&gt;here&lt;/a&gt;  and  &lt;a href="https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-2/"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id="data-querying"&gt;Data querying&lt;/h2&gt;

&lt;p&gt;The system that supports data compaction needs to know how to handle a mixture of compacted and not-yet-compacted data. Figure 4 illustrates three files that a query needs to read. File 1 and File 2 are Level 1 files. File 3 is a Level 0 file that overlaps with File 2.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/K4cSW8QDKpd7Nn87HSiYP/cc99190735da51eca9b3b78d21d92584/Data_Querying.png" alt="Data Querying" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 4: Three files that a query needs to read&lt;/figcaption&gt;

&lt;p&gt;Figure 5 illustrates a query plan that scans those three files. Because File 2 and File 3 overlap, they need to go through the Deduplicate &amp;amp; Merge operator. File 1 does not overlap with any file and only needs to be unioned with the output of the deduplication. Then all unioned data will go through the usual operators that the query plan has to process. As we can see, the more compacted and non-overlapped files can be produced during compaction as pre-query processing, the less deduplication work the query has to perform.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5fzI63PhMJqJcNdC6ZL4GB/dffab82c240c1888607be1639889c66e/Query_plan.png" alt="Query plan" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 5: Query plan that reads two overlapped files and one non-overlapped one&lt;/figcaption&gt;

&lt;h2 id="isolated-and-hidden-compactors"&gt;Isolated and hidden compactors&lt;/h2&gt;

&lt;p&gt;Since data compaction includes only post-ingestion and pre-query background tasks, we can perform them using a completely hidden and isolated server called a compactor. More specifically, data ingestion, queries, and compaction can be processed using three respective sets of servers: integers, queriers, and compactors that do not share resources at all. They only need to connect to the same catalog and storage (often cloud-based object storage), and follow the same protocol to read, write, and organize data.&lt;/p&gt;

&lt;p&gt;Because a compactor does not share resources with other database servers, it can be implemented to handle compacting many tables (or even many  &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;partitions&lt;/a&gt;  of a table) concurrently. In addition, if there are many tables and data files to compact, several compactors can be provisioned to independently compact these different tables or partitions in parallel.&lt;/p&gt;

&lt;p&gt;Furthermore, if compaction requires significantly less resources than ingestion or querying, then the separation of servers will improve the efficiency of the system. That is, the system could draw on many ingestors and queriers to handle large ingesting workloads and queries in parallel respectively, while only needing one compactor to handle all of the background post-ingestion and pre-querying work. Similarly, if the compaction needs a lot more resources, a system of many compactors, one ingestor, and one querier could be provisioned to meet the demand.&lt;/p&gt;

&lt;p&gt;A well-known challenge in databases is how to manage the resources of their servers — the ingestors, queriers, and compactors — to maximize their utilization of resources (CPU and memory) while never hitting out-of-memory incidents. It is a large topic and deserves its own blog post.&lt;/p&gt;

&lt;p&gt;Compaction is a critical background task that enables low latency for data ingestion and high performance for queries. The use of shared, cloud-based object storage has allowed database systems to leverage multiple servers to handle data ingestion, querying, and compacting workloads independently. For more information about the implementation of such a system, check out  &lt;a href="https://github.com/influxdata/influxdb"&gt;InfluxDB IOx&lt;/a&gt;. Other related techniques needed to design the system can be found in our companion articles on  &lt;a href="https://www.infoworld.com/article/3656915/scaling-throughput-and-performance-in-a-sharding-database-system.html"&gt;sharding&lt;/a&gt;  and  &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;partitioning&lt;/a&gt;.&lt;/p&gt;
</description>
      <pubDate>Mon, 27 Mar 2023 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/compactor-hidden-engine-database-performance/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/compactor-hidden-engine-database-performance/</guid>
      <category>Developer</category>
      <author>Paul Dix, Nga Tran (InfluxData)</author>
    </item>
    <item>
      <title>Using Deduplication for Eventually Consistent Transactions</title>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published in &lt;a href="https://www.infoworld.com/article/3683915/using-deduplication-for-eventually-consistent-transactions.html"&gt;InfoWorld&lt;/a&gt; and is reposted here with permission.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deduplication is an effective alternative to transactions for eventually consistent use cases of a distributed database. Here’s why.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building a distributed database is complicated and needs to consider many factors. Previously, I discussed two important techniques,  &lt;a href="https://www.infoworld.com/article/3656915/scaling-throughput-and-performance-in-a-sharding-database-system.html"&gt;sharding&lt;/a&gt;  and  &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;partitioning&lt;/a&gt;, for gaining greater throughput and performance from databases. In this post, I will discuss another important technique, deduplication, that can be used to replace transactions for eventually consistent use cases with defined primary keys.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/time-series-database/"&gt;Time series databases&lt;/a&gt;  such as InfluxDB provide ease of use for clients and accept ingesting the same data more than once. For example,  &lt;a href="https://www.influxdata.com/glossary/edge-computing/"&gt;edge devices&lt;/a&gt;  can just send their data on reconnection without having to remember which parts were successfully transmitted previously. To return correct results in such scenarios, time series databases often apply deduplication to arrive at an eventually consistent view of the  &lt;a href="https://www.influxdata.com/what-is-time-series-data/"&gt;data&lt;/a&gt;. For classic transactional systems, the deduplication technique may not be obviously applicable but it actually is. Let us step through some examples to understand how this works.&lt;/p&gt;

&lt;h2 id="understanding-transactions"&gt;Understanding transactions&lt;/h2&gt;

&lt;p&gt;Data inserts and updates are usually performed in an atomic commit, which is an operation that applies a set of distinct changes as a single operation. The changes are either all successful or all aborted, there is no middle ground. The atomic commit in the database is called a transaction.&lt;/p&gt;

&lt;p&gt;Implementing a transaction needs to include recovery activities that redo and/or undo changes to ensure the transaction is either completed or completely aborted in case of incidents in the middle of the transaction. A typical example of a transaction is a money transfer between two accounts, in which either money is withdrawn from one account and deposited to another account successfully or no money changes hands at all.&lt;/p&gt;

&lt;p&gt;In a distributed database, implementing transactions is even more complicated due to the need to communicate between nodes and tolerate various communication problems.  &lt;a href="https://en.wikipedia.org/wiki/Paxos_(computer_science)"&gt;Paxos&lt;/a&gt;  and  &lt;a href="https://raft.github.io/raft.pdf"&gt;Raft&lt;/a&gt;  are common techniques used to implement transactions in distributed systems and are well known for their complexity.&lt;/p&gt;

&lt;p&gt;Figure 1 shows an example of a money transferring system that uses a transactional database. When a customer uses a bank system to transfer $100 from account A to account B, the bank initiates a transferring job that starts a transaction of two changes: withdraw $100 from A and deposit $100 to B. If the two changes both succeed, the process will finish and the job is done. If for some reason the withdrawal and/or deposit cannot be performed, all changes in the system will be aborted and a signal will be sent back to the job telling it to re-start the transaction. A and B only see the withdrawal and deposit respectively if the process is completed successfully. Otherwise, there will be no changes to their accounts.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2M3g8oSy88t4Tgd1AsJvUw/b9dbb5c02bb6dd9113e475286d531563/Transactional_flow.png" alt="Transactional flow" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 1. Transactional flow.&lt;/figcaption&gt;

&lt;h2 id="non-transactional-process"&gt;Non-transactional process&lt;/h2&gt;

&lt;p&gt;Clearly, the transactional process is complicated to build and maintain. However, the system can be simplified as illustrated in Figure 2. Here, in the “non-transactional process,” the job also issues a withdrawal and a deposit. If the two changes succeed, the job completes. If neither or only one of the two changes succeeds, or if an error or timeout happens, the data will be in a “middle ground” state and the job will be asked to repeat the withdrawal and deposit.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/4geTCVolz1bPyIMFQZxcs8/ef3ad7751418a7f79984ce961100d477/Non-transactional_flow.png" alt="Non-transactional flow" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 2. Non-transactional flow.&lt;/figcaption&gt;

&lt;p&gt;The data outcomes in the “middle ground” state can be different for various restarts on the same transfer but they are acceptable to be in the system as long as the correct finish state will eventually happen. Let us go over an example to show these outcomes and explain why they are acceptable. Table 1 shows two expected changes if the transaction is successful. Each change includes four fields:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;AccountID&lt;/strong&gt;  that uniquely identifies an account.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Activity&lt;/strong&gt;  that is either a withdrawal or a deposit.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Amount&lt;/strong&gt;  that is the amount of money to withdraw or deposit.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;BankJobID&lt;/strong&gt;  that uniquely identifies a job in a system.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Table 1: Two changes of the money transfer transaction.&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;AccountID&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Activity&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Amount&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;BankJobID&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;p&gt;At each repetition of issuing the withdrawal and deposit illustrated in Figure 2, there are four possible outcomes:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;No changes.&lt;/li&gt;
  &lt;li&gt;Only A is withdrawn.&lt;/li&gt;
  &lt;li&gt;Only B is deposited.&lt;/li&gt;
  &lt;li&gt;Both A is withdrawn and B is deposited.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To continue our example, let us say it takes four tries before the job succeeds and an acknowledgement of success is sent. The first try produces “only B is deposited,” hence the system has only one change as shown in Table 2. The second try produces nothing. The third try produces “only A is withdrawn,” hence the system now has two rows as shown in Table 3. The fourth try produces “both A is withdrawn and B is deposited,” hence the data in the finished state looks like that shown in Table 4.&lt;/p&gt;

&lt;p&gt;Table 2: Data in the system after the first and second tries.&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;AccountID&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Activity&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Amount&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;BankJobID&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;p&gt;Table 3: Data in the system after the third try.&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;b&gt;AccountID&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Activity&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;Amount&lt;/b&gt;&lt;/td&gt;
&lt;td&gt;&lt;b&gt;BankJobID&lt;/b&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;p&gt;Table 4: Data in the system after the fourth try, now in the finish state.&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AccountID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Activity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amount&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BankJobID&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;h2 id="data-deduplication-for-eventual-consistency"&gt;Data deduplication for eventual consistency&lt;/h2&gt;

&lt;p&gt;The four-try example above creates three different data sets in the system, as shown in Tables 2, 3, and 4. Why do we say this is acceptable? The answer is that data in the system is allowed to be redundant as long as we can manage it effectively. If we can identify the redundant data and eliminate that data at read time, we will be able to produce the expected result.&lt;/p&gt;

&lt;p&gt;In this example, we say that the combination of AccountID, Activity, and BankJobID uniquely identifies a change and is called a key. If there are many changes associated with the same key, then only one of them is returned during read time. The process to eliminate redundant information is called  &lt;em&gt;deduplication&lt;/em&gt;. Therefore, when we read and deduplicate data from Tables 3 and 4, we will get the same returned values that comprise the expected outcome shown in Table 1.&lt;/p&gt;

&lt;p&gt;In the case of Table 2, which includes only one change, the returned value will be only a part of the expected outcome of Table 1. This means we do not get strong transactional guarantees, but if we are willing to wait to reconcile the accounts, we will eventually get the expected outcome. In real life, banks do not release transferred money for us to use immediately even if we see it in our account. In other words, the partial change represented by Table 2 is acceptable if the bank makes the transferred money available to use only after a day or two. Because the process of our transaction is repeated until it is successful, a day is more than enough time for the accounts to be reconciled.&lt;/p&gt;

&lt;p&gt;The combination of the non-transactional insert process shown in Figure 2 and data deduplication at read time does not provide the expected results immediately but eventually the results will be the same as expected. This is called an  &lt;em&gt;eventually consistent system&lt;/em&gt;. By contrast, the transactional system illustrated in Figure 1 always produces consistent results. However, due to the complicated communications requited to guarantee that consistency, a transaction does take time to finish and the number of transactions per second will consequently be limited.&lt;/p&gt;

&lt;h2 id="deduplication-in-practice"&gt;Deduplication in practice&lt;/h2&gt;

&lt;p&gt;Nowadays, most databases implement an update as a delete and then an insert to avoid the expensive in-place data modification. However, if the system supports deduplication, the update can just be done as an insert if we add a “Sequence” field in the table to identify the order in which the data has entered the system.&lt;/p&gt;

&lt;p&gt;For example, after making the money transfer successfully as shown in Table 5, let’s say we found the amount should be $200 instead. This could be fixed by making a new transfer with the same BankJobID but a higher Sequence number as shown in Table 6. At read time, the deduplication would return only the rows with the highest sequence number. Thus, the rows with amount $100 would never be returned.&lt;/p&gt;

&lt;p&gt;Table 5: Data before the “update”&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AccountID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Activity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amount&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BankJobID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sequence&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;p&gt;Table 6: Data after the “update”&lt;/p&gt;

&lt;div class="table-container is-v-centered"&gt;
&lt;table class="table is-bordered"&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AccountID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Activity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Amount&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;BankJobID&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Sequence&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Withdrawal&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Deposit&lt;/td&gt;
&lt;td&gt;200&lt;/td&gt;
&lt;td&gt;543&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
  &lt;/div&gt;

&lt;p&gt;Because deduplication must compare data to look for rows with the same key, organizing data properly and implementing the right deduplication algorithms are critical. The common technique is sorting data inserts on their keys and using a merge algorithm to find duplicates and deduplicate them. The details of how data is organized and merged will depend on the nature of the data, their size, and the available memory in the system. For example, Apache Arrow implements  &lt;a href="https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/"&gt;a multi-column sort merge&lt;/a&gt;  that is critical to perform effective deduplication.&lt;/p&gt;

&lt;p&gt;Performing deduplication during read time will increase the time needed to query data. To improve query performance, deduplication can be done as a background task to remove redundant data ahead of time. Most systems already run background jobs to reorganize data, such as removing data that was previously marked to be deleted. Deduplication fits very well in that model that reads data, deduplicates or removes redundant data, and writes the result back.&lt;/p&gt;

&lt;p&gt;In order to avoid sharing CPU and memory resources with data loading and reading, these background jobs are usually performed in a separate server called a compactor, which is another large topic that deserves its own post.&lt;/p&gt;
</description>
      <pubDate>Mon, 13 Mar 2023 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/using-deduplication-eventually-consistent-transactions/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/using-deduplication-eventually-consistent-transactions/</guid>
      <category>Use Cases</category>
      <category>Developer</category>
      <author>Nga Tran (InfluxData)</author>
    </item>
    <item>
      <title>Partitioning for Performance in a Sharding Database System</title>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;This article was originally published in &lt;a href="https://www.infoworld.com/article/3666513/partitioning-for-performance-in-a-sharding-database-system.html"&gt;InfoWorld&lt;/a&gt; and is reposted here with permission.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Partitioning can provide a number of benefits to a sharding system, including faster query execution. Let’s see how it works.&lt;/p&gt;

&lt;p&gt;In a &lt;a href="https://www.influxdata.com/blog/scaling-throughput-performance-sharding-database-system/"&gt;previous post&lt;/a&gt;, I described a sharding system to scale throughput and performance for query and ingest workloads. In this post, I will introduce another common technique, partitioning, that provides further advantages in performance and management for a sharding database. I will also describe how to handle partitions efficiently for both query and ingest workloads, and how to manage cold (old) partitions where the read requirements are quite different from the hot (recent) partitions.&lt;/p&gt;

&lt;h2 id="sharding-vs-partitioning"&gt;Sharding vs. partitioning&lt;/h2&gt;

&lt;p&gt;Sharding is a way to split data in a distributed database system. Data in each shard does not have to share resources such as CPU or memory, and can be read or written in parallel.&lt;/p&gt;

&lt;p&gt;Figure 1 is an example of a sharding database. Sales data of 50 states of a country are split into four shards, each containing data of 12 or 13 states. By assigning a query node to each shard, a job that reads all 50 states can be split between these four nodes running in parallel and will be performed four times faster compared to the setup that reads all 50 states by one node. More information about shards and their scaling effects on ingest and query workloads can be found in  &lt;a href="https://www.influxdata.com/blog/scaling-throughput-performance-sharding-database-system/"&gt;my previous post&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5LWHiEKJKNVYJmPVoTnyZ5/0916805ba5981f7cad4ec4175fbfc750/partitioning-effects.jpg" alt="partitioning-effects" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 1: Sales Data is split into four shards, each assigned to a query node.&lt;/figcaption&gt;

&lt;p&gt;Partitioning is a way to split data within each shard into non-overlapping partitions for further parallel handling. This reduces the reading of unnecessary data, and allows for efficiently implementing data retention policies.&lt;/p&gt;

&lt;p&gt;In Figure 2, the data of each shard is partitioned by sales day. If we need to create a report on sales of one specific day such as May 1, 2022, the query nodes only need to read data of their corresponding partitions of 2022.05.01.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2NzA88mtk0aQZELhoQ754N/86bbe76e4274687536be1ea8378597e3/Figure_2-_Sales_data_of_each_shard_is_further_split_into_non-overlapped_day_partitions.jpg" alt="Figure 2- Sales data of each shard is further split into non-overlapped day partitions" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 2: Sales data of each shard is further split into non-overlapped day partitions.&lt;/figcaption&gt;

&lt;p&gt;The rest of this post will focus on the effects of partitioning. We’ll see how to manage partitions efficiently for both query and ingest workloads on both hot and cold data.&lt;/p&gt;

&lt;h2 id="partitioning-effects"&gt;Partitioning effects&lt;/h2&gt;

&lt;p&gt;The three most common benefits of data partitioning are data pruning, intra-node parallelism, and fast deletion.&lt;/p&gt;

&lt;h3 id="data-pruning"&gt;Data pruning&lt;/h3&gt;

&lt;p&gt;A database system may contain several years of data, but most queries need to read only recent data (e.g., “How many orders have been placed in the last three days?”). Partitioning data into non-overlapping partitions, as illustrated in Figure 2, makes it easy to skip entire out-of-bound partitions and read and process only relevant and very small sets of data to return results quickly.&lt;/p&gt;

&lt;h3 id="intra-node-parallelism"&gt;Intra-node parallelism&lt;/h3&gt;

&lt;p&gt;Multithreaded processing and streaming data are critical in a database system to fully use available CPU and memory and obtain the best performance possible. Partitioning data into small partitions makes it easier to implement a multithreaded engine that executes one thread per partition. For each partition, more threads can be spawned to handle data within that partition. Knowing partition statistics such as size and row count will help allocate the optimal amount of CPU and memory for specific partitions.&lt;/p&gt;

&lt;h3 id="fast-data-deletion"&gt;Fast data deletion&lt;/h3&gt;

&lt;p&gt;Many organizations keep only recent data (e.g., data of the last three months) and want to remove old data ASAP. By partitioning data on non-overlapping time windows, removing old partitions becomes as simple as deleting files, without the need to reorganize data and interrupt other query or ingest activities. If all data must be kept, a section later in this post will describe how to manage recent and old data differently to ensure the systems provide great performance in all cases.&lt;/p&gt;

&lt;h2 id="storing-and-managing-partitions"&gt;Storing and managing partitions&lt;/h2&gt;

&lt;h3 id="optimizing-for-query-workloads"&gt;Optimizing for query workloads&lt;/h3&gt;

&lt;p&gt;A partition already contains a small set of data, so we do not want to store a partition in many smaller files (or chunks in the case of in-memory database). A partition should consist of just one or a few files.&lt;/p&gt;

&lt;p&gt;Minimizing the number of files in a partition has two important benefits. It both reduces I/O operations while reading data for executing a query, and it improves data encoding/compression. Improving encoding in turn lowers storage costs and, more importantly, improves query execution speed by reading less data.&lt;/p&gt;

&lt;h3 id="optimizing-for-ingest-workloads"&gt;Optimizing for ingest workloads&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Naive Ingestion.&lt;/strong&gt;  To keep the data of a partition in a file for the benefits of reading optimization noted above, every time a set of data is ingested, it must be parsed and split into the right partitions, then merged into the existing file of its corresponding partition, as illustrated in Figure 3.&lt;/p&gt;

&lt;p&gt;The process of merging new data with existing data often takes time because of expensive I/O and the cost of mixing and encoding the data of the partition. This will lead to long latency for responses back to the client that the data is successfully ingested, and for queries of the newly ingested data, as it will not immediately be available in storage.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/7syyxEEYP1WaWhaZynSPRC/378728a7d01f6e839897f488b3145ca6/Figure_3-_Naive_ingestion_in_which_new_data_is_merged_into_the_same_file_as_existing_data_immediately.jpg" alt="Figure 3- Naive ingestion in which new data is merged into the same file as existing data immediately" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 3: Naive ingestion in which new data is merged into the same file as existing data immediately.&lt;/figcaption&gt;

&lt;p&gt;&lt;strong&gt;Low latency ingestion.&lt;/strong&gt;  To keep the latency of each ingestion low, we can split the process into two steps: ingestion and compaction.&lt;/p&gt;

&lt;h2 id="ingestion"&gt;Ingestion&lt;/h2&gt;

&lt;p&gt;During the ingestion step, ingested data is split and written to its own file as shown in Figure 4. It is not merged with the existing data of the partition. As soon as the ingested data is successfully durable, the ingest client will receive a success signal and the newly ingested file will be available for querying.&lt;/p&gt;

&lt;p&gt;If the ingest rate is high, many small files will accumulate in the partition, as illustrated in Figure 5. At this stage, a query that needs data from a partition must read all of the files of that partition. This of course is not ideal for query performance. The compaction step, described below, keeps this accumulation of files to a minimum.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2xnnMznAGQBLiNR6UZo0lO/42a01ad86d59f7ab74e57e3b046a666e/Figure_4-_Newly_ingested_data_is_written_into_a_new_file.jpg" alt="Figure 4- Newly ingested data is written into a new file" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 4: Newly ingested data is written into a new file.&lt;/figcaption&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2L7kG6HJ04DIVCSSAHypIL/c79a9cf90eb9b7ca74b4afd5f8309f4a/Figure_5-_Under_a_high_ingest_workload_a_partition_will_accumulate_many_files.jpg" alt="Figure 5- Under a high ingest workload a partition will accumulate many files" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 5: Under a high ingest workload a partition will accumulate many files.&lt;/figcaption&gt;

&lt;h2 id="compaction"&gt;Compaction&lt;/h2&gt;

&lt;p&gt;Compaction is the process of merging the files of a partition into one or a few files for better query performance and compression. For example, Figure 6 shows all of the files in partition 2022.05.01 being merged into one file, and all of the files of partition 2022.05.02 being merged into two files, each smaller than 100MB.&lt;/p&gt;

&lt;p&gt;The decisions regarding how often to compact and the maximum size of compacted files will be different for different systems, but the common goal is to keep the query performance high by reducing I/Os (i.e., the number of files) and having the files large enough to effectively compress.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5IaIk1Y9nOp6me2QS64J1E/ad8b57ab7a78dfbfda1c6ccc0ac2aa59/Figure_6-_Compacting_several_files_of_a_partition_into_one_or_few_files.jpg" alt="Figure 6- Compacting several files of a partition into one or few files" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 6: Compacting several files of a partition into one or few files.&lt;/figcaption&gt;

&lt;h2 id="hot-vs-cold-partitions"&gt;Hot vs. cold partitions&lt;/h2&gt;

&lt;p&gt;Partitions that are queried frequently are considered hot partitions, while those that are rarely read are called cold partitions. In databases, hot partitions are usually the partitions containing recent data such as recent sales dates. Cold partitions often contain older data, which are less likely to be read.&lt;/p&gt;

&lt;p&gt;Moreover, when the data gets old, it is usually queried in larger chunks such as by month or even by year. Here are a few examples to unambiguously categorize data from hot to cold:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Hot: Data from the current week.&lt;/li&gt;
  &lt;li&gt;Less hot: Data from previous weeks but in the current month.&lt;/li&gt;
  &lt;li&gt;Cold: Data from previous months but in the current year.&lt;/li&gt;
  &lt;li&gt;More cold: Data of last year and older.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To reduce the ambiguity between hot and cold data, we need to find answers to two questions. First, we need to quantify hot, less hot, cold, more cold, and perhaps even more and more cold. Second, we need to consider how we can achieve fewer I/Os in the case of reading cold data. We do not want to read 365 files, each representing a one-day partition of data, just to get last year’s sales revenue.&lt;/p&gt;

&lt;h2 id="hierarchical-partitioning"&gt;Hierarchical partitioning&lt;/h2&gt;

&lt;p&gt;Hierarchical partitioning, illustrated in Figure 7, provides answers to the two questions above. Data for each day of the current week is stored in its own partition. Data from previous weeks of the current month are partitioned by week. Data from prior months in the current year are partitioned by month. Data that is even older is partitioned by year.&lt;/p&gt;

&lt;p&gt;This model can be relaxed by defining an active partition in place of the current date partition. All data arriving after the active partition will be partitioned by date, whereas data before the active partition will be partitioned by week, month, and year. This allows the system to keep as many small recent partitions as necessary. Even though all examples in this post partition data by time, non-time partitioning will work similarly as long as you can define expressions for a partition and their hierarchy.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5XL2h8Al5mmlLEHiOaNm9K/749ae422542f91b8f7ee7e58e8783914/Hierarchical-partitioning.jpg" alt="Hierarchical-partitioning" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 7: Hierarchical partitioning.&lt;/figcaption&gt;

&lt;p&gt;Hierarchical partitioning reduces the number of partitions in the system, making it easier to manage, and reducing the number of partitions that need to be read when querying larger and older chunks.&lt;/p&gt;

&lt;p&gt;The query process for hierarchical partitioning is the same as for non-hierarchical partitioning, as it will apply the same pruning strategy to read only the relevant partitions. The ingestion and compaction processes will be a bit more complicated, as it will be more difficult to organize the partitions in their defined hierarchy.&lt;/p&gt;

&lt;h2 id="aggregate-partitioning"&gt;Aggregate partitioning&lt;/h2&gt;

&lt;p&gt;Many organizations do not want to keep old data, but prefer instead to keep aggregations such as number of orders and total sales of every product every month. This can be supported by aggregating data and partitioning them by month. However, because the aggregate partitions store aggregated data, their schema will be different from non-aggregated partitions, which will lead to extra work for ingesting and querying. There are different ways to manage this cold and aggregated data, but they are large topics suitable for a future post.&lt;/p&gt;
</description>
      <pubDate>Fri, 18 Nov 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/partitioning-performance-sharding-database-system/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/partitioning-performance-sharding-database-system/</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <author>Nga Tran (InfluxData)</author>
    </item>
    <item>
      <title>Scaling Throughput and Performance in a Sharding Database System</title>
      <description>&lt;p&gt;&lt;strong&gt;This article was originally published in &lt;a href="https://www.infoworld.com/article/3656915/scaling-throughput-and-performance-in-a-sharding-database-system.html"&gt;InfoWorld&lt;/a&gt; and is reposted here with permission.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Understand the two dimensions of scaling for database query and ingest workloads, and how sharding can make scaling elastic — or not.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Scaling throughput and performance are critical design topics for all distributed databases, and sharding is usually a part of the solution. However, a design that increases throughput does not always help with performance and vice versa. Even when a design supports both, scaling them up and down at the same time is not always easy.&lt;/p&gt;

&lt;p&gt;This post will describe these two types of scaling for both query and ingest workloads, and discuss sharding techniques that make them elastic. Before we dive into the database world, let us first walk through an example of elastic throughput and performance scaling from daily life.&lt;/p&gt;

&lt;h2 id="scaling-effects-in-a-fast-food-restaurant"&gt;Scaling effects in a fast food restaurant&lt;/h2&gt;

&lt;p&gt;Nancy is opening a fast food restaurant and laying out the scenarios to optimize her operational costs on different days of the week. Figure 1 illustrates her business on a quiet day. For the restaurant to be open, there are two lines which must remain open: drive-thru and walk-in. Each requires one employee to cover. On average, each person needs six minutes to process an order, and the two employees should be able to cover the restaurant’s expected throughput of 20 customers per hour.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/1Vzn0klstQ4weQzf7khspf/7e22c2cd0692c4c8af9cbc141db28969/Figure_1-_The_restaurant_operation_on_a_quiet_day.jpg" alt="Figure 1- The restaurant operation on a quiet day" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 1: The restaurant operation on a quiet day.&lt;/figcaption&gt;

&lt;p&gt;Let’s assume that an order can be processed in parallel by at most two people, one making drinks and the other making food. Nancy’s employees are trained to go and help with the other line if their line is empty. Doubling up on a single line reduces the order processing time to three minutes and helps keep the throughput steady when customers enter the lines at various intervals.&lt;/p&gt;

&lt;p&gt;Figure 2 shows a busier day with around 50% more customers. Adding an employee should cover the 50% &lt;em&gt;increase in throughput&lt;/em&gt;. Nancy requests her team to be flexible:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;If only one customer comes to a line at a time, one person should run between two lines to help reduce the processing time so they will be available to help new customers immediately.&lt;/li&gt;
  &lt;li&gt;If a few customers walk in at the same time, employees should open a new line to help at least two walk-in customers at the same time because Nancy knows walk-in customers tend to be happier when their orders are taken immediately but very tolerant with the six minute processing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/1HNDtitF8d0MDo8kwaIK9T/f0a2edfc166859f54383b7ef6c040baa/Figure_2_-_The_operation_that_covers_50_more_customers.jpg" alt="Figure 2 - The operation that covers 50 more customers" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 2: The operation that covers 50% more customers.&lt;/figcaption&gt;

&lt;p&gt;To smoothly handle the busiest days of the year, which draw some 80 customers per hour, Nancy builds a total of four counters: one drive-thru and three walk-ins, as shown in Figure 3. Since adding a third person to help with an order won’t help reduce the order time, she plans to staff up to two employees per counter. A few days a year, when the town holds a big event and closes the street (making the drive-thru inaccessible), Nancy accepts her &lt;em&gt;max throughput&lt;/em&gt; will be 60 customers per hour.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/7l6lgOdoViVmz29yoGRDto/891b0e62e1e354c08f30510b1b4626e5/Figure_3_-_The_operation_on_a_busy_day.jpg" alt="Figure 3 - The operation on a busy day" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 3: The operation on a busy day.&lt;/figcaption&gt;

&lt;p&gt;Nancy’s order handling strategy elastically scales customer &lt;em&gt;throughput&lt;/em&gt; (i.e., scales as needed) while also applying flexibility to make order processing time (i.e., &lt;em&gt;performance&lt;/em&gt;) faster. Important points to notice:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The &lt;em&gt;max performance scaling factor&lt;/em&gt;  (max number of employees to help with one order) is two. Nancy &lt;em&gt;cannot change this factor&lt;/em&gt;  if she wants to stick with the same food offerings.&lt;/li&gt;
  &lt;li&gt;The &lt;em&gt;max throughput&lt;/em&gt;  is 80 customers per hour due to the max number of counters being four. Nancy  &lt;em&gt;could change this factor&lt;/em&gt;  if she has room to add more counters to her restaurant.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="scaling-effects-in-a-sharding-database-system"&gt;Scaling effects in a sharding database system&lt;/h2&gt;

&lt;p&gt;Similar to the operation at a fast food restaurant, a database system should be built to support elastic scaling of throughput and performance for both query and ingest workloads.&lt;/p&gt;

&lt;h3 id="query-workload"&gt;Query workload&lt;/h3&gt;

&lt;p&gt;Term definition:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Query throughput scaling:&lt;/em&gt; the ability to scale up and down the number of queries executed in a defined amount of time such as a second or a minute.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Query performance scaling:&lt;/em&gt; the ability to make a query run faster or slower.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Elastic scaling:&lt;/em&gt; the ability to scale throughput or performance up and down easily based on traffic or other needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="examples"&gt;Examples&lt;/h3&gt;

&lt;p&gt;Let’s assume our sales data is stored in an accessible storage location such as a local disk or a remote disk or a cloud. Three teams in the company, Reporting, Marketing, and Sales, want to query this data frequently. Our first setup, illustrated in Figure 4, is to have one query node to receive all queries from all three teams, read the data, and return the query results.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2AE42bcwFBFNTDQXai1B4T/b8852dbbe736e6bd68bb787e97fc3b60/Figure_4_-_One_query_node_handles_all_requests.jpg" alt="Figure 4 - One query node handles all requests" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 4: One query node handles all requests.&lt;/figcaption&gt;

&lt;p&gt;At first this setup works well but when more and more queries are added, the wait time to get results back becomes quite large. Worse, many times the queries get lost due to timeouts. To deal with the &lt;em&gt;increasing query throughput requests&lt;/em&gt;, a new setup shown in Figure 5 provides four query nodes. Each of these nodes works independently for our different business purposes: one for the Reporting team, one for the Marketing team, one for the Sales team focusing on small customers, and one for the Sales team focusing on large customers.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/6dxV2HvgbWkOwTBxSArggQ/7c2e447b19a3677cae83b437c22df45d/Figure_5_-_Add_more_query_nodes.jpg" alt="Figure 5 - Add more query nodes" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 5: Add more query nodes, one for each business purpose, to handle more throughput.&lt;/figcaption&gt;

&lt;p&gt;The new setup catches up well with the high volume of throughput and no queries get lost. However, for some time-sensitive queries that the teams need to react to immediately, waiting several minutes to get the result back is not good enough. To solve this problem, the data is split equally into four shards, where each shard contains data of 12 or 13 states, as shown in Figure 6. Because the Reporting team runs the most latency sensitive queries, a query cluster of four nodes is built for them to &lt;em&gt;perform queries four times faster&lt;/em&gt;. The Marketing team is still happy with its single-node setup, so data from all shards is directed to that one node.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/4Wlrzj50k2P6nz23U80vLC/b37799385ad726ce3248cdb409d3d464/Figure_6_-_Shard_data_and_add_Query_Nodes_to_handle_sharded_data_in_parallel.jpg" alt="Figure 6 - Shard data and add Query Nodes to handle sharded data in parallel" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 6: Shard data and add Query Nodes to handle sharded data in parallel.&lt;/figcaption&gt;

&lt;p&gt;The Sales team does not deal with time-sensitive queries, but as this team grows larger, the number of query requests keep increasing. Therefore, the Sales team should &lt;em&gt;take advantage of performance scaling to improve throughput&lt;/em&gt; and avoid reaching max throughput in the near future. This is done by replacing two independent query nodes with two independent query clusters, one with four nodes and the other two nodes, based on their respective growth.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2q3LfAqNB9q9wRP52qEt9T/3b6f8ea16b8c168d0d2121205cb9ae83/Figure_7_-_Adjust_the_size_of_the_Reporting_cluster.jpg" alt="Figure 7 - Adjust the size of the Reporting cluster" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 7: Adjust the size of the Reporting cluster based on the Reporting team’s performance needs and shut down a Sales cluster based on the Sales team’s throughput needs.&lt;/figcaption&gt;

&lt;p&gt;During times of the year when the Reporting team does not need to handle time-sensitive queries, two query nodes of its cluster are temporarily removed to save resources, as shown in Figure 7. Similarly, when the Sales team does not need to handle high throughput workloads, it temporarily removes one of its clusters and directs all queries to the remaining one.&lt;/p&gt;

&lt;p&gt;The teams are happy with their elastic scaling setup. The current setup allows all teams to scale throughput up and down easily, by adding or removing query clusters. However, the Reporting team notices that its query performance does not improve beyond the limit factor of four query nodes; scaling query nodes beyond that limit doesn’t help. Thus we can say that the Reporting team’s &lt;em&gt;query throughput scaling is fully elastic, but its query performance scaling is only elastic to the scale factor of four&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The only way the Reporting team can scale query performance further is to split data into more and smaller shards, which is not trivial. We’ll discuss this next.&lt;/p&gt;

&lt;h3 id="ingest-workload"&gt;Ingest workload&lt;/h3&gt;

&lt;p&gt;Term definition:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Ingest throughput scaling:&lt;/em&gt;  the ability to scale up and down the amount of ingested data in a defined amount of time such as a second or a minute.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Ingest performance scaling:&lt;/em&gt;  the ability to increase or decrease the speed of ingesting a set of data into the system.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id="examples-1"&gt;Examples&lt;/h3&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/14AMlEbOWWQanUwoiadFxN/a7aee0b87b273c0dd6268e4839c54e6d/Figure_8_-_One_ingest_node_handles_all_ingested_data.jpg" height="608" width="500" alt="Figure 8 - One ingest node handles all ingested data" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 8: One ingest node handles all ingested data.&lt;/figcaption&gt;

&lt;p&gt;In order to have four shards of sales data as described above, the ingest data must be sharded at load time. Figure 8 illustrates an ingest node that takes all ingest requests, shards them accordingly, handles pre-ingest work, and then saves the data to the right shard.&lt;/p&gt;

&lt;p&gt;However, when the ingest data increases, one ingest node no longer catches up with the requests and ingest data gets lost. Thus a new setup shown in Figure 9 is built to add more ingest nodes, each handling data for a different set of write requests to support higher ingest throughput.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5OpPfBo9JfPo94G3JNxP9/25914eeafa60e18faccf83befa0a8185/Figure_9_-_Add_ingest_nodes.jpg" alt="Figure 9 - Add ingest nodes" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 9: Add ingest nodes, each handling a subset of write requests, to support more throughput.&lt;/figcaption&gt;

&lt;p&gt;Even though the new setup handles a higher ingest volume of throughput and no data gets lost, the increasing demand of lower ingest latency makes the teams think they need to change the setup further. The ingest nodes that need lower ingest latency are converted into ingest clusters, shown in Figure 10.&lt;/p&gt;

&lt;p&gt;Here each cluster includes a shard node that is responsible for sharding the coming data and additional ingest nodes. Each ingest node is responsible for processing pre-ingest work for its assigned shards and sending the data to the right shard storage. The performance of Ingest Cluster 2 is twice that of Ingest Node 1, as the latency is now around half of the previous setup. Ingest Cluster 3 is around four times as fast as Ingest Node 1.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/5IguGmobKaoR8klufFlyy/6e63aa6db99fd2dce3a52d1b302026df/Figure_10_-_Convert_ingest_nodes_to_ingest_clusters_to_speed_up_data_ingest.jpg" alt="Figure 10 - Convert ingest nodes to ingest clusters to speed up data ingest" /&gt;&lt;/p&gt;
&lt;figcaption&gt;Figure 10: Convert ingest nodes to ingest clusters to speed up data ingest.&lt;/figcaption&gt;

&lt;p&gt;During times of the year when the latency is not critical, a couple of nodes are temporarily removed from Ingest Cluster 3 to save resources. When ingest throughput is minimal, Ingest Cluster 2 and Ingest Cluster 3 are even shut down and all write requests are directed to Ingest Node 1 for ingesting.&lt;/p&gt;

&lt;p&gt;As with their query workloads, the Reporting, Marketing, and Sales teams are very happy with the elastic scaling setup for their ingest workloads. However, they notice that even though ingest throughput scales up and down easily by adding and removing ingest clusters, when Ingest Cluster 3 has reached its scale factor of four, adding more ingest nodes to its cluster doesn’t improve performance. Thus we can say that its ingest throughput scaling is fully elastic, but its  &lt;em&gt;ingest performance scaling is only elastic to the scale factor of four&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id="preparing-for-future-elasticity"&gt;Preparing for future elasticity&lt;/h2&gt;

&lt;p&gt;As demonstrated in the examples, the query and ingest throughput scaling of the setups in Figure 6 and Figure 10 are fully elastic, but their performance scaling is only elastic to the scale factor of four. To support a higher  &lt;em&gt;performance scaling factor&lt;/em&gt;, the data should be split into smaller shards, e.g., one shard per state. However, when we go with a smaller scale factor, many shards must be mapped to one query node in the query cluster. Similarly, one ingest node must handle the data of many shards.&lt;/p&gt;

&lt;p&gt;A limitation of performance scaling is that increasing the scale factor (i.e., splitting data into smaller shards) does not mean the system will scale as expected due to the overhead or limitations of each use case—as we saw in Nancy’s fast food restaurant, where the max performance scaling factor was two employees per order.&lt;/p&gt;

&lt;p&gt;The elastic throughput and performance scalings described in this post are just examples to help us understand their role in a database system. The real designs to support them are a lot more complicated and need to consider more factors.&lt;/p&gt;
</description>
      <pubDate>Wed, 16 Nov 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/scaling-throughput-performance-sharding-database-system/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/scaling-throughput-performance-sharding-database-system/</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <author>Nga Tran (InfluxData)</author>
    </item>
  </channel>
</rss>
