InfluxData is Building a Fast Implementation of Apache Arrow in Go Using c2goasm and SIMD

Navigate to:

InfluxData is pleased to announce our contribution to the Apache Arrow project. Essentially, we are contributing work that we already started: the development of a Go implementation of Apache Arrow. We believe in open source and are committed to participating in and contributing to the open source community in meaningful ways. We developed an interest in Apache Arrow for a number of reasons which we describe in more detail below, and contributing our initial efforts to the Apache Software Foundation ensures that the community maintains the focus within that repository.

Apache Arrow specifies a standardized, language-independent, columnar memory format for flat and hierarchical data that is organized for efficient, analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and inter-process communication. As we have been working on developing a new query processing engine and language for InfluxDB, currently known as Flux f.k.a. IFQL, Arrow provides a superior way to both exchange data between the database and the query processing engine while also providing an additional means for InfluxData to participate in a broader ecosystem of data processing and analysis tools.

Why Arrow?

One of many goals for Flux f.k.a. IFQL is to enable new ways to efficiently query and analyze your data using industry-standard tools. One such example is pandas, an open source library that provides advanced features for data analytics and visualization. Another is Apache Spark, a scalable data processing engine. We discovered these and many other open source projects, and commercial software offerings, are adopting Apache Arrow to address the challenge of sharing columnar data efficiently. The Apache Arrow mission statement defines a number of goals that resonated with the team at InfluxData:

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.

Specifically:

  • Standardized: Many projects in the data science and analytics space are adopting Arrow as it addresses a common set of design problems including how to efficiently exchange large data sets. Examples of early adopters include pandas and Spark, and the list continues to grow.
  • Performance: The specification is clear that performance is the raison d'être. Arrow data structures are designed to work efficiently on modern processors, enabling the use of features like single-instruction, multiple-data (SIMD).
  • Language-Independent: Mature libraries exist for C/C++, Python, Java and Javascript with libraries for Ruby and Go in active development. More libraries mean more ways to work with your data.

We also recognize Apache Arrow as an opportunity to participate and contribute to a community that will face similar challenges. A problem shared is a problem halved.

Apache Arrow at InfluxData

We have identified a few areas where InfluxDB will benefit from Apache Arrow:

  • represent in-memory TSM columnar data,
  • perform aggregations using SIMD math kernels and
  • the data communication protocol between InfluxDB and Flux f.k.a. IFQL.

For Flux f.k.a. IFQL:

  • represent block data structures,
  • perform aggregations using SIMD math kernels and
  • the primary communication protocol to clients.

In the future, we expect that a user could create a Jupyter Notebook, execute an Flux f.k.a. IFQL query in Python and manipulate the data efficiently in pandas, with little overhead.

Apache Arrow in Go

At the time of writing, the Go implementation has support for the following features:

Memory Management

  • Allocations are 64-byte aligned and padded to 8-bytes

Array and Builder Support

Primitive Types

  • Signed and unsigned 8, 16, 32 and 64 bit integers
  • 32 and 64 bit floats
  • Packed LSB booleans
  • Variable-length binary arrays

Parametric Types

  • Timestamp

Type Metadata

  • Data types

SIMD Math Kernels

  • SIMD optimized Sum operations for 64-bit float, int and unsigned int arrays

SIMD Your Go with No Assembly Required, Using This One Weird Trick!

Before we share the magic, let’s delve a little deeper into why SIMD or single-instruction, multiple-data is relevant. It is no accident that most data structures in Apache Arrow occupy continuous blocks of memory as arrays or vectors. Using special instructions, many of today’s CPUs can process tightly packed data like this in parallel, improving the performance of specific algorithms and operations. Even better, compilers are built with a host of advanced optimizations, such as auto vectorization, to take advantage of these features without the developer having to write any assembly. During compilation, the compiler may identify loops that process arrays as candidates for auto vectorization, and generate more efficient machine code utilizing SIMD instructions. Alas, the Go compiler lacks these optimizations, leaving us to fend for ourselves. We could write these routines in assembly, but that is hard enough without having to use Go’s esoteric Plan 9 syntax. To make matters worse, in order to write optimal code in assembly for a specific architecture, you must be familiar with other issues like instruction scheduling, data dependencies, AVX-SSE transition penalties and more.

clang + c2goasm = ??

c2goasm, developed by minio, is an awesome command-line tool that transforms the assembly output of functions written in C/C++ into something the Go Plan 9 assembler will understand. These are not the same as CGO and just as efficient to call as any other Go function. A caveat of routines written in Go assembly is they cannot be inlined, so it is important they do enough work to negate the overhead of the function call. The examples in the release announcement make use of intrinsics, which are compiler extensions that provide access to processor-specific features like wide data types (__m256) and functions that map to processor instructions (_mm256_load_ps). Using intrinsics vs. writing pure assembly allows the developer to mix high-level C code with low-level processor features, whilst still allowing the compiler to perform a limited set of optimizations.

Our first experiment was to take a Go function that summed a slice 64-bit floats and determine if we could improve it with c2goasm. We benchmarked 1,000 element slices, as they match the maximum size of a TSM block in InfluxDB. The benchmarks were collected on an early 2017 MacBook Pro running at 2.9GHz.

The reference implementation in Go ran at 1200 ns/op or 6,664 MB/s:

func SumFloat64(buf []float64) float64 {
	acc := float64(0)
	for i := range buf {
		acc += buf[i]
	}
	return acc
}

Following a similar approach to the c2goasm post, we used AVX2 intrinsics to produce this abridged implementation:

void sum_float64_avx_intrinsics(double buf[], size_t len, double *res) {
    __m256d acc = _mm256_set1_pd(0);
    for (int i = 0; i < len; i += 4) {
        __m256d v = _mm256_load_pd(&buf[i]);
        acc = _mm256_add_pd(acc, v);
    }

    acc = _mm256_hadd_pd(acc, acc); // a[0] = a[0] + a[1], a[2] = a[2] + a[3]
    *res = _mm256_cvtsd_f64(acc) + _mm_cvtsd_f64(_mm256_extractf128_pd(acc, 1));
}
This version summed the 1,000 64-bit floats (double in C) at 255 ns/op or a rate of 31,369 MB/s – 4.7× is a handy improvement. Intel x86 AVX2 intrinsics are a specific set of extensions for working with 256-bits of data or 4×64-bit float values using single instructions. There is a bit going on here, so let's summarize what the code does:
  • initialize the accumulator acc, a data type representing 4×64-bit float elements, to 0
  • for each iteration:
  • load the next 4 64-bit float elements from buf into v
  • add the corresponding elements of v to acc, i.e. acc[0] += v[0], acc[1] += v[1], acc[2] += v[2] and acc[3] += v[3]
  • if more elements in buf, increment by 4 and restart loop
  • sum acc[0]+acc[1]+acc[2]+acc[3]
  • convert to a double and return the value
It is worth noting that there are a couple of cons to using intrinsics:
  • the cognitive load required to understand this function is much higher than the Go or plain C version
  • we'll need a separate implementation using SSE4 intrinsics, as calling this function on a machine that does not support AVX2 extensions will SEGFAULT
There are situations where using intrinsics or writing assembly is the best option, but for a simple loop like this, we decided to explore an alternative. Earlier, we mentioned auto-vectorization, so let's see what an optimizing compiler can do with a plain C version, similar to Go:
void sum_float64(double buf[], int len, double *res) {
    double acc = 0.0;
    for(int i = 0; i < len; i++) {
        acc += buf[i];
    }
    *res = acc;
}
1,000 floats summed in just 58ns or a rate of 137 GB/s1. Not too shabby, when all we did was specify a few compiler flags to enable optimizations, including loop vectorization, loop unrolling and to generate AVX2 instructions. By writing a portable C/C++ version, we can generate an SSE4 version or target a completely different architecture like ARM64, with only minor alterations to the compiler flags; a benefit that cannot be overstated. According to the specs for the Intel Core i7 6920HQ, it has a maximum memory bandwidth of 34.1 GB/s. 137 GB/s is well above this number, so what is going on? Caching. We can attribute the blazing speed to the data residing in one of the processor caches. Therefore, the sooner you operate on data read from main memory, the more likely you will benefit from caching.

Automating the Code Generation

There are a few steps required to go from the C source to the final Go assembly:
  1. execute clang with the correct compiler flags to generate the base assembly, producing foo_ARCH.s
  2. execute c2goasm to transform foo_ARCH.s into Go assembly
  3. repeat 1 and 2 for each target architecture (e.g. SSE4, AVX2 or ARM64)
If "A" changes, build "B"; if "B" changes, build "C". Sounds like a job for make and that is exactly what we did. Any time we update the C source, we simply run make generate to update the dependent files. We also check the generated assembly files in to the repository to ensure the Go package is go gettable.

Using These Optimizations in Go

If the AVX2 version of the function is called on a processor that does not support these extensions, your program will crash, which isn't ideal. The solution to this is to determine what processor features are available at runtime and call the appropriate function, falling back to the pure Go version, if necessary. The Go runtime does this in a number of places, using the internal/cpu package and we took a similar approach with some improvements. At startup, the most efficient functions are selected based on available processor features. However, if an environment variable named INTEL_DISABLE_EXT is present, disable any of the specified optimizations. If this is of interest to you, we've documented the feature in the repository. For example, to disable AVX2 and use the next best set of features for a hypothetical application, myapp:
$ INTEL_DISABLE_EXT=AVX2 myapp

Conclusion

There is still plenty of work to be done to reach feature parity with the C++ implementation of Apache Arrow and we look forward to sharing our future contributions.