Write a Database in 50 Lines of Code

Navigate to:

Writing your own database

Writing a database isn’t something that most people set out to do on a whim. There are a ton of different components and things to think about to build something that is truly functional and performant.

But what if it wasn’t so difficult? What if you could use a set of tools like building blocks, interchanging them in a modular fashion, to write something that simply worked? There can be a lot of ‘reinventing the wheel’ when it comes to databases, so the less time you need to spend on table stakes, the more time you can spend on the features you really need.

This is where open source and the community of developers that contribute to open source projects come into play. I thought it would be an interesting exercise to see if I could build a database using open source tools and as little code as possible.

The FDAP stack

Underpinning my thinking on this is a great article my colleague Andrew Lamb wrote on the FDAP stack (Apache Flight, Arrow, DataFusion, and Parquet). At the same time he was working on that, I was experimenting separately with the Flight Python libraries. In many cases, these libraries represent the best way for users to interact with InfluxDB 3.0.

After some learning, I put the simplest example of implementing an OLAP database using Python into a few lines of code. This demonstrates the power and simplicity of the Apache Arrow project. The result really highlights the power of FDAP for creating domain-specific analytic databases.

This is possible because the Apache Arrow project maintains Python bindings and libraries for pretty much everything. So, starting from Python, I can access the underlying, high-performance code maintained upstream. This is particularly notable in the case of DataFusion, a library written in Rust for executing SQL queries against Arrow tables. As you will see below, I wrote about eight lines of code, which allowed me to load a Parquet file and execute SQL queries against its contents in a feature-complete and high-performance way.

This is a trivial example and the code is inefficient, but it proves the power of a healthy upstream project, and the benefits to the community of working upstream.

FlightServer

There is an upstream class called FlightServerBase that implements all of the basic functionality to serve the standard Flight gRPC endpoints. Apache Flight is a protocol that allows users to send commands to the database, and receive back Arrow. Using FlightServer has some important advantages: The data returns to the client as Arrow. We can efficiently convert ‌that Arrow data to Pandas or Polars. This means we can support high-performance analytics.

Most popular programming languages have a Flight client library. As a result, users don’t need to use your own bespoke tools, and you don’t need to support those tools either.

To implement such a simple Flight server, you only need to implement two of the myriad methods on FlightServerBase:

  1. do_put(): This function receives a FlightDescriptor object and a MetadataRecordBatchReader. While that is a mouthful of a class name, it is really just an object that allows the server to read the data that a user passes in. It also supports nice things like batching, for example.
  2. do_get(): This is the function that processes a Flight Ticket, executes a query, and returns the results. As you will see, adding SQL support for this is dead easy.

do_put()

The do_put function receives a FlightDescriptor and a Reader, and uses those two objects to add the data to the existing Parquet file.

While this write implementation is trivial – it is non-performant and lacks any sort of schema validation – its simplicity is also instructive.

To respond to a write, the code extracts the table name from the FlightDescriptor, and loads the data passed from the Reader into memory. It also calculates the file path for the Parquet file where the data is persisted.

table_name = descriptor.path[0].decode('utf-8')
data_table = reader.read_all()
file_path = f"{table_name}.parquet"

In the case where there is already data with that table name, the code loads the existing data into a table, then concatenates the new and old tables.

if os.path.exists(file_path):
    try:
        existing_table = pq.read_table(file_path)
        data_table = pa.concat_tables([data_table, existing_table])
    except Exception as e:
        print(e)

Next, it writes the table to a Parquet file:

try:
pq.write_table(data_table, file_path)
except Exception as e:
print(e)

Here is some client code that creates some data, converts it to Arrow tables, creates a FlightClient, and calls do_put() to write the data:

import json
import io
from pyarrow.flight import FlightDescriptor, FlightClient
import pyarrow as pa

data = [{"col1":3, "col2":"one"},
{"col1":3, "col2":"two"}]

table = pa.Table.from_pylist(data)

descriptor = FlightDescriptor.for_path("mytable")
client = FlightClient("grpc://localhost:8081")

writer, _ = client.do_put(descriptor, table.schema)
writer.write_table(table)
print(f"wrote: {table}")
writer.close()

Note that these are all upstream objects in the Apache Arrow community. We don’t need any custom libraries. If you run the example write.py file, you will see that it works fine.

% python3 write.py                
wrote: pyarrow.Table
col1: int64
col2: string
----
col1: [[3,3]]
col2: [["one","two"]]

do_get()

At this point, you might be thinking: “That’s cool and all, but writing files to disk is not hard.” That may be true, but what about writing a SQL implementation that can query that data? Have you ever written a parser, a planner, and everything else needed to make a SQL implementation work? That’s not so easy.

In this section, you can see how – with just a few lines of code – you can use DataFusion to add a full-featured and fast query experience.

The do_get() function receives a ticket object, which has the information needed to execute the query. In this case, that includes the table name and the SQL query itself.

ticket_obj = json.loads(ticket.ticket.decode())
sql_query = ticket_obj["sql"]
table_name = ticket_obj["table"]

Then, it creates a DataFusion SessionContext and reads the Parquet file into it.

ctx = SessionContext()
ctx.register_parquet(table_name, f"{table_name}.parquet")

Finally, it executes the query and returns the result.

result = ctx.sql(sql_query)
table = result.to_arrow_table()
return flight.RecordBatchStream(table)

Here is some client code that uses the pyarrow library to execute a query and outputs the results as a Pandas DataFrame:

from pyarrow.flight import Ticket, FlightClient
import json

client = FlightClient("grpc://localhost:8081")
ticket_bytes = json.dumps({'sql':'select * from mytable', 'table':'mytable'})
ticket = Ticket(ticket_bytes)
reader = client.do_get(ticket)
print(reader.read_all().to_pandas())
% python3 read.py 
  col1  col2
0  one     1
1  two     2
2  one     1
3  two     2

It’s important to note that a mere eight lines of code allows you to add DataFusion support. In doing so, you get an incredibly fast and complete SQL engine. For example, you can see all the SQL statements DataFusion supports in its documentation.

Here is a slightly modified SQL query you can also try:

'select mean(col2) as mean from mytable'

% python3 read.py
   mean
0   1.5

We value open source

At InfluxData, one of our core company values is “We Value Open Source.” We demonstrate this value in different ways. For example, our support for the Telegraf project, and the fact that we always offer a permissively licensed open source version of InfluxDB. With the introduction of InfluxDB 3.0, the way we value open source expanded and deepened.

I’m sure you can think of many companies that bill themselves as the “Company Behind” some open source technology. This model typically entails the company supporting an open source version that is free to use, permissively licensed, and with limited functionality. The company also offers one or more commercial versions of its software with more functionality. In these cases, the open source version largely functions as a lead generation tool for the company’s commercial products.

InfluxData, the company behind InfluxDB, is no different. For both versions 1 and 2 of InfluxDB, InfluxData also released paid commercial versions.

True, the InfluxDB source code may be interesting for anyone wishing to build their own database, but that code also probably has limited utility. Sure, it was useful to fix bugs and extend the code. But if you wanted to create a new type of database, like a location database, then the code for our time series database likely isn’t a useful starting point. In other words, the core code and components of the database weren’t inherently reusable.

This is where things really changed with InfluxDB 3.0. Yes, InfluxData is still the company behind InfluxDB. Yes, open source InfluxDB 3.0 is already in the works. Yes, we offer commercial versions.

However, a significant amount of the code that we write now goes into the upstream Apache Arrow project. The Apache Arrow project, and specifically the FDAP stack, is meant to be a starting point for anyone who wants to create a new type of database. This means that our contributions to the Apache Arrow project benefit community members seeking to create their own innovations in the database field.

Try it yourself

If you want to play around with the code and client examples, you can find them in this github repo.

I hope that this basic database example demonstrates the power of the “We value open source” idea. We don’t have to limit our contributions to our own projects to support open source. Hopefully, our contributions to, and support of the upstream Apache Arrow project reflect that commitment. We also hope that the availability of these high-quality and performant tools inspires others to innovate around specific domains in the database space.