Writing Data to InfluxDB with Python

Navigate to:

One of the questions we see fairly often from Python developers who are using InfluxDB is how to improve the write performance of programs which use the InfluxDB client library. Usually they’re trying to import or transfer large amounts of data and want to make sure it can be inserted into the database quickly enough for their business needs. In this post, we’ll go over the common pitfalls that people encounter, as well as some performance best practices.

The script

Let’s look at an example of a Python script that writes data to InfluxDB. You can find it here. The first part of the script (up until Line 56) is used for creating the data we’ll use to populate our measurements; there is a list of locations, a list of fruits, and a function which generates 1000 UUIDs.

We create an array, and append the first value to it using our data as follows:

data = []
data.append("{measurement},location={location},fruit={fruit},id={id} x={x},y={y},z={z}i {timestamp}"
            .format(measurement=measurement_name,
                    location=random.choice(location_tags),
                    fruit=random.choice(fruit_tags),
                    id=random.choice(id_tags),
                    x=round(random.random(),4),
                    y=round(random.random(),4),
                    z=random.randint(0,50),
                    timestamp=data_start_time))

This will append a string that looks like:

m1,location=location,fruit=fruit,id=id x=10,y=1,z=42i 1562458785618

At the beginning of the script we defined a variable which contains the number of points we’d like to write: in this case, 250,000. Once that data has been generated and loaded into the array, we can write it to the database as follows:

client.write_points(data, database='python-throughput', time_precision='ms', batch_size=10000, protocol='line')

Measuring performance

Most Unix systems have some version of the time command, which measures the amount of time it takes for a process to execute and then prints it to the command line.

$ time echo "hello world"
hello world

real	0m0.000s
user	0m0.000s
sys	0m0.000s

It’s a tool that many Linux users are familiar with, but the issue with measuring write performance this way is that you end up timing the entirety of the program. If you’re generating random data as we are, or parsing a CSV, this can be a meaningful chunk of time. In order to measure individual parts of the program, we’ll make use of the time module’s perf_counter() function.

time.perf_counter() accesses the CPU's hardware performance counter and returns a value in fractional seconds. The performance counter is the highest resolution timer available for your system. It starts when the CPU is powered on, and runs continuously, which means that to measure the elapsed time of something you can get the time of the counter before and after an event occur and then subtract the start value from the end value.

Since we want to understand the performance of writing data, we use time.perf_counter() to time the write_points() function.

client_write_start_time = time.perf_counter()

client.write_points(data, database='python-throughput', time_precision='ms', batch_size=10000, protocol='line')

client_write_end_time = time.perf_counter()

print("Client Library Write: {time}s".format(time=client_write_end_time - client_write_start_time))

This is perfect for our use case, but if your needs are different, Python also offers a module called timeit, which “provides a simple way to time small bits of Python code”. For larger, more complex applications, you might want to look into using a profiler, which will give you detailed information about the entirety of your program’s execution.

Let’s run the script! We’ll go ahead and use the time function to time the execution of the entire script so we can compare that to the execution of the write_points() function.

$ time python3 write_test.py 
Client Library Write: 3.4734167899999995s

real	0m6.060s
user	0m2.387s
sys	0m0.114s

I ran the script 5 times, dropping the database and starting fresh between each run: the average write time for 250,000 points was 3.81 seconds on my 2018 dual-core MacBook Pro (which has an admitedly fast SSD), while the average elapsed time for the entire script (including generating the data and the time required to initialize and exit the program) was 6.35 seconds.

Favor Line Protocol over JSON

The Python client library’s write_points function can take data in either JSON or InfluxDB Line Protocol. Looking at the code, though, we discover that any data provided in JSON format gets converted to Line Protocol anyway before being sent to InfluxDB:

if protocol == 'json':
  data = make_lines(data, precision).encode('utf-8')

The make_lines() function can be found here, and does the work of converting from JSON to Line Protocol.

If we can avoid that work by providing data in Line Protocol to being with, we’ll save some time. Let’s run a few tests and see what the data looks like. You’ll find a JSON version of the script here. I ran it 5 times and averaged the results again. With the data in JSON instead of Line Protocol, it took an average of 7.65 seconds to insert all the data, about two times slower than using Line Protocol directly.

Optimization 2: Batch Your points

InfluxDB performs best when data is written to the database in batches. This helps minimize the network overhead of opening and closing HTTP connections by transmitting more data at once. The ideal batch size for InfluxDB is 5,000-10,000 points.

By default, the write_points() function will transmit all of the points in the array passed to the function, but you can also provide a parameter which defines the batch size. In the example script, we generate 250,000 points up front, and then write those points in batches of 10,000:

client.write_points(data, database='python-throughput', time_precision='ms', batch_size=10000, protocol='line')

For comparison, I ran the script with a batch size of 50; with that setting, the write_points portion of our code averaged about 29 seconds! An order of magnitude higher!

Next steps: Measure more things!

If your program is still executing too slowly for you, it’s always worthwhile to examine other properties of the system where your code is running. How do your resources look? RAM, CPU, disk space? What is the throughput of your network connection, or disk I/O? There are a lot of external factors which can influence the execution of your code: be methodical about testing them and collecting data to rule out the cause.

If you have more ideas for optimizing write performance with the Python library, we’d love to hear them! Open a pull request, comment on our community forums, or reach out directly to me at [email protected].