Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet
By
Andrew Lamb /
Xiangpeng Hao /
Developer
Aug 22, 2024
Navigate to:
Editor’s Note: This is the first of a two part blog series. This link will be updated when the second part is posted.
This blog describes our experience implementing StringView in the Rust implementation of Apache Arrow, and integrating it into Apache DataFusion, significantly accelerating string-intensive queries in the ClickBench benchmark by 20%- 200% (Figure 11).
Getting significant end-to-end performance improvements was non-trivial. Implementing StringView itself was only a fraction of the effort required. Among other things, we had to optimize UTF-8 validation, implement unintuitive compiler optimizations, tune block sizes, and time GC to realize the FDAP ecosystem’s benefit. With other members of the open source community, we were able to overcome performance bottlenecks that could have killed the project. We would like to contribute by explaining the challenges and solutions in more detail so that more of the community can learn from our experience.
StringView is based on a simple idea: avoid some string copies and accelerate comparisons with inlined prefixes. Like most great ideas, it is “obvious” only after someone describes it clearly. Although simple, straightforward implementation actually slows down performance for almost every query. We must, therefore, apply astute observations and diligent engineering to realize the actual benefits from StringView.
Although this journey was successful, not all research ideas are as lucky. To accelerate the adoption of research into industry, it is valuable to integrate research prototypes with practical systems. Understanding the nuances of real-world systems makes it more likely that research designs2 will lead to practical system improvements.
StringView support was released as part of arrow-rs v52.2.0 and DataFusion v41.0.0. You can try it by setting the schema_force_string_view
DataFusion configuration option, and we are hard at work with the community to make it the default. We invite everyone to try it out, take advantage of the effort invested so far, and contribute to making it better.
Figure 1: StringView improves string-intensive ClickBench query performance by 20% - 200%
Section 1: What is StringView?
Figure 2: Use StringArray and StringViewArray to represent the same string content.
The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray3 type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.
A StringViewArray consists of three components:
- The view array
- The buffers
- The buffer pointers (IDs) that map buffer offsets to their physical locations
Each view is 16 bytes long, and its contents differ based on the string’s length:
- string length < 12 bytes: the first four bytes store the string length, and the remaining 12 bytes store the inlined string.
- string length > 12 bytes: the string is stored in a separate buffer. The length is again stored in the first 4 bytes, followed by the buffer id (4 bytes), the buffer offset (4 bytes), and the prefix (first 4 bytes) of the string.
Figure 2 shows an example of the same logical content (left) using StringArray (middle) and StringViewArray (right):
- The first string –
“Apache DataFusion”
– is 17 bytes long, and both StringArray and StringViewArray store the string’s bytes at the beginning of the buffer. The StringViewArray also inlines the first 4 bytes –“Apac”
– in the view. - The second string,
“InfluxDB”
is only 8 bytes long, so StringViewArray completely inlines the string content in the view struct while StringArray stores the string in the buffer as well. - The third string
“Arrow Rust Impl”
is 15 bytes long and cannot be fully inlined. StringViewArray stores this in the same form as the first string. - The last string
“Apache DataFusion”
has the same content as the first string. It’s possible to use StringViewArray to avoid this duplication and reuse the bytes by pointing the view to the previous location.
StringViewArray provides three opportunities for outperforming StringArray:
- Less copying via the offset + buffer format
- Faster comparisons using the inlined string prefix
- Reusing repeated string values with the flexible view layout
The rest of this blog post discusses how to apply these opportunities in real query scenarios to improve performance, what challenges we encountered along the way, and how we solved them.
Section 2: Faster Parquet Loading
ApacheParquet is the de facto format for storing large-scale analytical data commonly stored LakeHouse-style, such as Apache Iceberg and Delta Lake. Efficiently loading data from Parquet is thus critical to query performance in many important real-world workloads.
Parquet encodes strings (i.e., byte array) in a slightly different format than required for the original Arrow StringArray. The string length is encoded inline with the actual string data (as shown in Figure 4 left). As mentioned previously, StringArray requires the data buffer to be continuous and compact—the strings have to follow one after another. This requirement means that reading Parquet string data into an Arrow StringArray requires copying and consolidating the string bytes to a new buffer and tracking offsets in a separate array. Copying these strings is often wasteful. Typical queries filter out most data immediately after loading, so most of the copied data is quickly discarded.
On the other hand, reading Parquet data as a StringViewArray can re-use the same data buffer as storing the Parquet pages because StringViewArray does not require strings to be contiguous. For example, in Figure 4, the StringViewArray directly references the buffer with the decoded Parquet page. The string “Arrow Rust Impl”
is represented by a view
with offset 37 and length 15 into that buffer.
Figure 4: StringViewArray avoids copying by reusing decoded Parquet pages.
Mini benchmark
Reusing Parquet buffers is great in theory, but how much does saving a copy actually matter? We can run the following benchmark in arrow-rs to find out:
cargo bench --bench arrow_reader --features="arrow test_common experimental" "arrow_array_reader/Binary.*Array/plain encoded"
Our benchmarking machine shows that loading BinaryViewArray is almost 2x faster than loading BinaryArray (see next section about why this isn’t StringViewArray).
arrow_array_reader/BinaryArray/plain encoded time: [315.86 µs **317.47 µs** 319.00 µs]
arrow_array_reader/BinaryViewArray/plain encoded
time: [162.08 µs **162.20 µs** 162.32 µs]
You can read more on this arrow-rs issue: https://github.com/apache/arrow-rs/issues/5904
Section 2.1: From binary to strings
You may wonder why we reported performance for BinaryViewArray when this post is about StringViewArray. Surprisingly, initially, our implementation to read StringViewArray from Parquet was much slower than StringArray. Why? TLDR: Although reading StringViewArray copied less data, the initial implementation also spent much more time validating UTF-8 (as shown in Figure 5).
Strings are stored as byte sequences. When reading data from (potentially untrusted) Parquet files, a Parquet decoder must ensure those byte sequences are valid UTF-8 strings, and most programming languages, including Rust, include highly optimized routines for doing so.
Figure 5: Time to load strings from Parquet. The UTF-8 validation advantage initially eliminates the advantage of reduced copying for StringViewArray.
A StringArray can be validated in a single call to the UTF-8 validation function as it has a continuous string buffer. As long as the underlying buffer is UTF-84, all strings in the array must be UTF-8. The Rust parquet reader makes a single function call to validate the entire buffer.
However, validating an arbitrary StringViewArray requires validating each string with a separate call to the validation function, as the underlying buffer may also contain non-string data (for example, the lengths in Parquet pages).
UTF-8 validation in Rust is highly optimized and favors longer strings (as shown in Figure 6), likely because it leverages SIMD instructions to perform parallel validation. The benefit of a single function call to validate UTF-8 over a function call for each string more than eliminates the advantage of avoiding the copy for StringViewArray.
Figure 6: UTF-8 validation throughput vs string length—StringArray’s contiguous buffer can be validated much faster than StringViewArray’s buffer.
Does this mean we should only use StringArray? No! Thankfully, there’s a clever way out. The key observation is that in many real-world datasets, 99% of strings are shorter than 128 bytes, meaning the encoded length values are smaller than 128, in which case the length itself is also valid UTF-8 (in fact, it is ASCII).
This observation means we can optimize validating UTF-8 strings in Parquet pages by treating the length bytes as part of a single large string as long as the length value is less than 128. Put another way, prior to this optimization, the length bytes act as string boundaries, which require a UTF-8 validation on each string. After this optimization, only those strings with lengths larger than 128 bytes (less than 1% of the strings in the ClickBench dataset) are string boundaries, significantly increasing the UTF-8 validation chunk size and thus improving performance.
The actual implementation is only nine lines of Rust (with 30 lines of comments). You can find more details in the related arrow-rs issue: https://github.com/apache/arrow-rs/issues/5995. As expected, with this optimization, loading StringViewArray is almost 2x faster than loading StringArray.
Section 2.2: Be careful about implicit copies
After all the work to avoid copying strings when loading from Parquet, performance was still not as good as expected. We tracked the problem to a few implicit data copies that we weren’t aware of, as described in this issue.
The copies we eventually identified come from the following innocent-looking line of Rust code, where self.buf
is a reference counted pointer that should transform without copying into a buffer for use in StringViewArray.
let block_id = output.append_block(self.buf.clone().into());
However, Rust-type coercion rules favored a blanket implementation that did copy data. This implementation is shown in the following code block where the impl<T: AsRef<[u8]>>
will accept any type that implements AsRef<[u8]>
and copies the data to create a new buffer. To avoid copying, users need to explicitly call from_vec
, which consumes the Vec
and transforms it into a buffer.
impl<T: AsRef<[u8]>> From<T> for Buffer {
fn from(p: T) -> Self {
// copies data here
...
}
}
impl Buffer {
pub fn from_vec<T>(data: Vec<T>) -> Self {
// zero-copy transformation
...
}
}
Diagnosing this implicit copy was time-consuming as it relied on subtle Rust language semantics. We needed to track every step of the data flow to ensure every copy was necessary. To help other users and prevent future mistakes, we also removed the implicit API from arrow-rs in favor of an explicit API. Using this approach, we found and fixed several other unintentional copies in the code base—hopefully, the change will help other downstream users avoid unnecessary copies.
Section 2.3: Help the compiler by giving it more information
The Rust compiler’s automatic optimizations mostly work very well for a wide variety of use cases, but sometimes, it needs additional hints to generate the most efficient code. When profiling the performance of view
construction, we found, counterintuitively, that constructing long strings was 10x faster than constructing short strings, which made short strings slower on StringViewArray than on StringArray!
As described in Section 1, StringViewArray treats long and short strings differently. Short strings (<12 bytes) directly inline to the view struct, while long strings only inline the first 4 bytes. The code to construct a view
looks something like this:
if len <= 12 {
// Construct 16 byte view for short string
let mut view_buffer = [0; 16];
view_buffer[0..4].copy_from_slice(&len.to_le_bytes());
view_buffer[4..4 + data.len()].copy_from_slice(data);
...
} else {
// Construct 16 byte view for long string
ByteView {
length: len,
prefix: u32::from_le_bytes(data[0..4].try_into().unwrap()),
buffer_index: block_id,
offset,
}
}
It appears that both branches of the code should be fast: they both involve copying at most 16 bytes of data and some memory shift/store operations. How could the branch for short strings be 10x slower?
Looking at the assembly code using godbolt, we (with help from Ao Li) found the compiler used CPU load instructions to copy the fixed-sized 4 bytes to the view
for long strings, but it calls a function, ptr::copy_non_overlapping, to copy the inlined bytes to the view
for short strings. The difference is that long strings have a prefix size (4 bytes) known at compile time, so the compiler directly uses efficient CPU instructions. But, since the size of the short string is unknown to the compiler, it has to call the general-purpose function ptr::copy_non_coverlapping
. Making a function call is significant unnecessary overhead compared to a CPU copy instruction.
However, we know something the compiler doesn’t know: the short string size is not arbitrary—it must be between 0 and 12 bytes, and we can leverage this information to avoid the function call. Our solution generates 13 copies of the function using generics, one for each of the possible prefix lengths. The code looks as follows, and checking the assembly code, we confirmed there are no calls to ptr::copy_non_overlapping
, and only native CPU instructions are used. For more details, see the ticket.
fn make_inlined_view<const LEN: usize>(data: &[u8]) -> u128 {
let mut view_buffer = [0; 16];
view_buffer[0..4].copy_from_slice(&(LEN as u32).to_le_bytes());
view_buffer[4..4 + LEN].copy_from_slice(&data[..LEN]);
u128::from_le_bytes(view_buffer)
}
pub fn make_view(data: &[u8], block_id: u32, offset: u32) -> u128 {
let len = data.len();
// generate special code for each of the 13 possible lengths
match len {
0 => make\_inlined\_view::<0>(data),
1 => make\_inlined\_view::<1>(data),
2 => make\_inlined\_view::<2>(data),
3 => make\_inlined\_view::<3>(data),
4 => make\_inlined\_view::<4>(data),
5 => make\_inlined\_view::<5>(data),
6 => make\_inlined\_view::<6>(data),
7 => make\_inlined\_view::<7>(data),
8 => make\_inlined\_view::<8>(data),
9 => make\_inlined\_view::<9>(data),
10 => make\_inlined\_view::<10>(data),
11 => make\_inlined\_view::<11>(data),
12 => make\_inlined\_view::<12>(data),
_ => {
// handle long string
}}}
Section 2.4: End-to-end query performance
In the previous sections, we went out of our way to make sure loading StringViewArray is faster than StringArray. Before going further, we wanted to verify if obsessing about reducing copies and function calls has actually improved end-to-end performance in real-life queries. To do this, we evaluated a ClickBench query (Q20) in DataFusion that counts how many URLs contain the word "google"
:
SELECT COUNT(*) FROM hits WHERE "URL" LIKE '%google%';
This is a relatively simple query; most of the time is spent on loading the “URL” column to find matching rows. The query plan looks like this:
Projection: COUNT(*) [COUNT(*):Int64;N]
Aggregate: groupBy=[[]], aggr=[[COUNT(*)]] [COUNT(*):Int64;N]
Filter: hits.URL LIKE Utf8("%google%")
TableScan: hits
We ran the benchmark in the DataFusion repo like this:
cargo run --profile release-nonlto --bin dfbench -- clickbench --queries-path benchmarks/queries/clickbench/queries.sql --iterations 3 --query 20 --path benchmarks/data/hits.parquet --string-view
With StringViewArray we saw a 24% end-to-end performance improvement, as shown in Figure 7. With the –string-view argument, the end-to-end query time is 944.3 ms, 869.6 ms, 861.9 ms (three iterations). Without –string-view, the end-to-end query time is 1186.1 ms, 1126.1 ms, 1138.3 ms. Figure 7: StringView reduces end-to-end query time by 24% on ClickBench Q20.
We also double-checked with detailed profiling and verified that the time reduction is indeed due to faster Parquet loading.
Conclusion
In this first blog post, we have described what it took to improve the performance of simply reading strings from Parquet files using StringView. While this resulted in real end-to-end query performance improvements, in our next post, we explore additional optimizations enabled by StringView in DataFusion, along with some of the pitfalls we encountered while implementing them.
Note: Thanks to InfluxData for sponsoring this work as a summer intern project