loading...
Cover image for uWSGI stats monitoring from scratch using Telegraf InfluxDB and Grafana

uWSGI stats monitoring from scratch using Telegraf InfluxDB and Grafana

cheviana profile image Jane Radetska Updated on ・16 min read

At the end of this tutorial, you'll end up with dashboard like this. Each panel shows some uWSGI metric as time series.

uWSGI stats, changing in time

Watching over uWSGI-reported statistics like worker busyness is super helpful for investigating uWSGI configurations. Also useful in production environment - it's possible to extend described approach to monitor real-life uWSGI web applications, but this post doesn't aim to cover that.

In future, I plan to add posts about how different uWSGI options influence behavior of web server, and how to monitor Python web app aspects, such as networking, using same tools.

Tutorial roadmap

Here's what will be described:

  • setup and run simple uWSGI webserver
  • load that webserver using wrk2
  • install and run InfluxDB
  • install, configure and run Telegraf
  • install and run Grafana
  • create dashboard in Grafana to show uWSGI metrics
  • each monitored uWSGI metric dive-in

Code and configs mentioned in here can be found in the repo.

The diagram

Diagram of uWSGI webserver and monitoring tools

uWSGI stats

uWSGI can expose stats on a separate socket.

The simplest way to see these metrics is to use uwsgitop - https://github.com/xrmx/uwsgitop. However, uwsgitop only shows current metrics readings, not history data, just like Linux top command.

Some time ago I created a fork of uwsgitop because of an encoding-related bug in its output. I had fun a time trying to figure out why uwsgitop won't work. That bug seems to be fixed now though.

I think it's nicer to be able to see uWSGI stats metrics over a continuous period of time as that provides more information than just this moment readings.

uwsgitop is like a speedometer readings - change every second. ⌚

Monitoring dashboard is like a cardiogram - recorded readings over time. πŸ“ˆ

Choosing monitoring tools

One option for continuous monitoring is to use Prometheus with exporter for uWSGI stats; here, I'll describe other option - TIG stack.

TIG stack differs from Prometheus mainly in the idea of relying on some other tool to push metrics to time series DB, instead of pulling model which Prometheus uses. Actually, both of the stacks can do push and pull, and there's discussion going on about which method works better for which type of thing you want to watch over.

For the kind of local comparative experiments I'm planning to run on uWSGI web servers in following posts, I don't think there's much difference which monitoring solution to use.

I picked TIG for this post because I have some experience with it and Prometheus approach is described elsewhere.

One little pebble thrown in the direction of Prometheus: its uwsgi-exporter seems to not support reporting of uWSGI worker status which is indeed very useful metric. Probably support for that will be added in time.

List of what Telegraf uWSGI plugin can monitor is rather impressive.

Simple uWSGI web server

Here's code of "Hello World" uWSGI app in uwsgi-hello-world.py:

import time


def application(env, start_response):
    start_response('200 OK', [('Content-Type','text/html')])
    time.sleep(0.25)  # sleep 250 msec
    return [b'Hello World']

And basic uWSGI configs in uwsgi-hello-world-configs.ini:

[uwsgi]
http-socket = 127.0.0.1:9090
wsgi-file = uwsgi-hello-world.py
master = true
processes = 4
threads = 2 
stats = 127.0.0.1:9191
stats-http = true

This code doesn't do anything useful. Worker sleeps for 250 msec and returns a string.
Can use wsgi.py of more interesting Django or Flask server or any other Python web app (change wsgi = path/to/Django-app/wsgi.py in options).

uWSGI can be installed as Python package. Let's assume you have Python installed, know how to create and activate Python virtual environment (if not, please consult virtualenv docs and virtualenvwrapper is handy).

To install uWSGI Python package:

> mkvirtualenv --python=python3 uwsgi-playground
> pip install uWSGI

To run uWSGI server:

> uwsgi --ini uwsgi-hello-world-configs.ini

That last command should produce output similar to:

...
uwsgi socket 0 bound to TCP address 127.0.0.1:9090
*** Operational MODE: preforking+threaded ***
WSGI app 0 ... ready in 0 seconds ...
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: ...)
spawned uWSGI worker 1 (pid: ..., cores: 2)
spawned uWSGI worker 2 (pid: ..., cores: 2)
spawned uWSGI worker 3 (pid: ..., cores: 2)
spawned uWSGI worker 4 (pid: ..., cores: 2)
*** Stats server enabled on 127.0.0.1:9191 fd: ... ***

This means that the web server has launched and it's listening on port 9090.

Can visit hello-world server in local web browser - on http://127.0.0.1:9090/:
hello world web server response

uWSGI process outputs to console that it processed requests after we visited hello-world page:

[pid: ...|app: 0|req: 1/1] 127.0.0.1 () {42 vars in 783 bytes} [Sun Jul 19 20:21:38 2020] GET / => generated 11 bytes in 250 msecs (HTTP/1.1 200) 1 headers in 44 bytes (2 switches on core 0)
[pid: ...|app: 0|req: 1/2] 127.0.0.1 () {38 vars in 670 bytes} [Sun Jul 19 20:21:38 2020] GET /favicon.ico => generated 11 bytes in 143 msecs (HTTP/1.1 200) 1 headers in 44 bytes (2 switches on core 0)

uWSGI can also be configured to log those lines to a file (see uWSGI logging docs).

Can also visit stats endpoint in local web browser - on http://127.0.0.1:9191/:

...
"workers": [
    {
        "id":2,
        "pid":...,
        "accepting":1,
        "requests":1,
        "delta_requests":1,
        "avg_rt":71928
        ....
]
...

That JSON contains "speedometer" readings of what's going on inside uWSGI webserver.

Loading web server with benchmark tool wrk2

To keep uWSGI web server busy (so that stats are not just all zero), let's load uWSGI hello-world server using wrk2 benchmarking tool. Can use any other artificial load generator. I picked wrk2 because I don't need complicated test scenario for this post. Some load testing tools can be easily configured to write to InfluxDB, which makes observing results of load test real handy, right next to web server stats and stats reported from web app code.

To install wrk2:

> git clone https://github.com/giltene/wrk2
> cd wrk2
> make

If make doesn't work for you, can use wrk tool - which provides install wiki pages for all platforms.

To run wrk2 or wrk:

> ./wrk -t2 -c2 -d1200s -R1 http://127.0.0.1:9090/

This command creates 2 threads that try to load webserver with 1 RPS, keeping 2 HTTP connections open.
It will create load for 1200 sec which is 20 min. Feel free to adjust.

Tools for real-time monitoring: Graphana, Telegraf, InfluxDB (also known as TIG stack)

To install all tools locally on MacOS:

> brew install influxdb  <-- Database for metrics
> brew install telegraf  <-- agent-collector of metrics
> brew install graphana  <-- UI for metrics exploration and plotting

To download all tools binaries locally on Linux:

> wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.2_linux_amd64.tar.gz
> tar xvfz influxdb-1.8.2_linux_amd64.tar.gz
> wget https://dl.influxdata.com/telegraf/releases/telegraf-1.15.2_linux_amd64.tar.gz
> tar xf telegraf-1.15.2_linux_amd64.tar.gz
> wget https://dl.grafana.com/oss/release/grafana-7.1.4.linux-amd64.tar.gz
> tar -zxvf grafana-7.1.4.linux-amd64.tar.gz

For other platforms:

Docker containers are available in this post, or:

> docker pull influxdb
> docker pull telegraf
> docker run -d --name=grafana -p 3000:3000 grafana/grafana

Run time series database InfluxDB

Launch InfluxDB:

> influxd -config /usr/local/etc/influxdb.conf

This starts up DB process.
Create database for uWSGI metrics (in other shell tab):

> influx -precision rfc3339  <-- opens CLI
Connected to http://localhost:8086 version v1.8.1
InfluxDB shell version: v1.8.1
> CREATE DATABASE localmetrics  <-- creates our DB
> SHOW DATABASES  <-- shows DB list
name: databases
name
----
_internal
localmetrics
>

It is now possible to run "INSERT ..." command using CLI, which will add metric reading to the database.

Run Telegraf - stats pull/push agent

We want to have uWSGI stats sent to database automatically every N seconds, and in a format that InfluxDB will understand.
uWSGI passively exposes stats data on socket 9191.
Need something that will query uWSGI stats endpoint and send metrics data to InfluxDB database.

This is where Telegraf comes into light. Telegraf has ability to retrieve metrics data, transform it to format InfluxDB understands using plugins and send it over to InfluxDB database.
Telegraf has a bunch of input plugins: to watch over CPU consumption levels, to read and parse log file tail, to listen to messages on socket and various web servers integrations, including uWSGI.

It's a matter of adding a few lines to Telegraf config to enable uWSGI stats reporting. Resulting telegraf.conf is available in the repo.
To understand Telegraf tool better, let's use telegraf sample-config utility to compose configs with which telegraf can consume uWSGI metrics:

> telegraf -sample-config --input-filter uwsgi --output-filter influxdb > telegraf.conf
> cat telegraf.conf
...
# Read uWSGI metrics.
[[inputs.uwsgi]]
## List with urls of uWSGI Stats servers. URL must match pattern:
## scheme://address[:port]
##
## For example:
## servers = ["tcp://localhost:5050", "http://localhost:1717", "unix:///tmp/statsock"]
servers = ["tcp://127.0.0.1:1717"]

## General connection timeout
# timeout = "5s"

tcp://127.0.0.1:1717 part doesn't match where our uWSGI exposes stats on. It's http://127.0.0.1:9191. Need to update that in telegraf.conf file.

Another thing that needs to be updated is the address of where to send stats:

# Configuration for sending metrics to InfluxDB
[[outputs.influxdb]]
## The full HTTP or UDP URL for your InfluxDB instance.
##
## Multiple URLs can be specified for a single cluster, only ONE of the
## urls will be written to each interval.
# urls = ["unix:///var/run/influxdb.sock"]
# urls = ["udp://127.0.0.1:8089"]
urls = ["http://127.0.0.1:8086"]

## The target database for metrics; will be created as needed.
## For UDP url endpoint database needs to be configured on server side.
database = "uWSGI"

I uncommented urls = ["http://127.0.0.1:8086"] line and added database = "uWSGI" so metrics from uWSGI stats server will flow in separate database.

Resulting telegraf.conf is in the repo.

To run telegraf:

> telegraf -config telegraf.conf

Looking in InfluxDB console output, can see:

2020-07-20T01:18:26.727302Z info    Executing query {"log_id": "0O6K1AQG000", "service": "query", "query": "CREATE DATABASE uWSGI"}
[httpd] 127.0.0.1 - - [19/Jul/2020:21:18:26 -0400] "POST /query HTTP/1.1" 200 57 "-" "Telegraf/1.14.5" e9bd1368-ca26-11ea-8005-88e9fe853b3a 428
[httpd] 127.0.0.1 - - [19/Jul/2020:21:18:40 -0400] "POST /write?db=uWSGI HTTP/1.1" 204 0 "-" "Telegraf/1.14.5" f1a77b04-ca26-11ea-8006-88e9fe853b3a 137748
[httpd] 127.0.0.1 - - [19/Jul/2020:21:18:50 -0400] "POST /write?db=uWSGI HTTP/1.1" 204 0 "-" "Telegraf/1.14.5" f79d5380-ca26-11ea-8007-88e9fe853b3a 7695
[httpd] 127.0.0.1 - - [19/Jul/2020:21:19:00 -0400] "POST /write?db=uWSGI HTTP/1.1" 204 0 "-" "Telegraf/1.14.5" fd928134-ca26-11ea-8008-88e9fe853b3a 8534
[httpd] 127.0.0.1 - - [19/Jul/2020:21:19:10 -0400] "POST /write?db=uWSGI HTTP/1.1" 204 0 "-" "Telegraf/1.14.5" 0388c512-ca27-11ea-8009-88e9fe853b3a 9345

This means Telegraf is already busy writing uWSGI metrics to database, but we can't see them yet.

Run Grafana - UI for metrics monitoring

Launch grafana web server:

> cd path/to/dir/with/installed/graphana
> bin/grafana-server

Navigating to http://localhost:3000/ in browser opens window with Grafana UI. Login using "admin" "admin" and create new pass as it asks to.

I'm providing screenshots of how to deal with Grafana UI as it was confusing to me. Here Grafana version v7.1.0 is featured on screenshots, UI might look different for other version.

What's left to setup:

  • create data source in Grafana for InfluxDB
  • add panels that will show metrics readings over time - can do "import it all" way, can do manually

Create InfluxDB data source in Grafana

Pick "Configuration -> Data Sources -> Add data source". Select InfluxDB.
Grafana configuration menu

Put "http://127.0.0.1:8086" in HTTP/URL input and database name "uWSGI" in "InfluxDB Details/Database" input.
New datasource screen in Grafana

Click on "Save and Test" button in the bottom of the screen, green noty reading "Data source is working" should appear.
Success new datasource in Grafana

Visualizing uWSGI metrics in Grafana

Now Grafana can read metrics from database, let's visualize them - add graph panel for each metric.

Grafana dashboard consists of panels, panel can show how particular metric changed in time. There are lots of types of panels but we will only deal with Time Series panel in this tutorial.

Here's dashboard snapshot with measurements of hello-world app with 4 workers being loaded by artificial users. That dashboard is monitoring following uwsgi metrics:

  • harakiri count
  • worker status (busy/idle/cheap/total)
  • listen queue size
  • workers amount
  • worker requests
  • in-request sum
  • respawn count
  • worker avg request time
  • worker running time
  • load sum

There are more things that can be monitored: amount of transmitted data for worker, amount of exceptions for worker, etc. I didn't need these metrics or prefer to monitor them in other ways.

Can also configure uWSGI memory reporting - how much memory uWSGI consumes. I prefer to watch memory consumption from system monitoring though, along with CPU consumption.

uWSGI even allows to expose your own metrics.

Cheaper subsystem plugin, and other plugins, have their own metrics too.

I am providing instructions on how to setup all panels at once, using dashboard JSON, or how to manually add panels one by one, and learn a bit more in the process.

Manual dashboard setup: add panels one by one

Might be beneficial to read this first How to add a panel in Grafana.

How to create new panel in new dashboard:

  • Go to "Dashboards -> Manage dashboards", click on "New dashboard"
  • click "New panel" or "Add panel" in top right corner
  • on dropdown next to new panel title pick "Edit"

How to populate panel with avg worker request time data:
In panel edit mode, select Query source - InfluxDB.
Modify query builder inputs:

  • From default uwsgi_workers
  • Select field(avg_rt) mean() Edit panel query

It is important to tell the panel that the metric it displays is measured in microseconds: in the right column, expand "Axes", "Left Y", "Unit" - "microseconds".
Edit panel series units

Also, it's nice to configure points thickness so that you can see them clearly ("Display" - "Line width" in the right column).
Give a meaningful name to that panel (in top of right column), save panel (click "Apply" in top right corner).

Congrats, now you have a panel that shows avg request time of all uwsgi workers in real time. Make sure that the refresh frequency selector is set to like 10 sec (tiny dropdown in top right corner of dashboard page) and that the webserver, wrk2, telegraf, and InfluxDB are all still running.

Similar process should be repeat for rest of panel - find queries to use below in this post, and can paste queries in panel "edit query" input.

Automatic dashboard setup: import uWSGI stats dashboard JSON

To setup uWSGI stats monitoring dashboard, can use JSON in uwsgi-dashboard-model.json, and export that to new dashboard:

  • Go to "Dashboards -> Manage dashboards", click on "New dashboard"
  • Go to Dashboard settings ("wheel" button at the top right of the page with new dashboard)
  • In the left menu, pick "JSON Model"
  • In JSON, find the "panels" field (it should be empty) and paste there contents of field "panels" from uwsgi-dashboard-model.json. Pasting all JSON doesn't work for me for some reason.
  • Click "Save changes" at the bottom of Dashboard settings page. Export dashboard 1 Export dashboard 2

Congrats, now you have dashboard that shows uWSGI stats in real time. Make sure refresh frequency selector is set to like 10 sec, tiny dropdown in top right corner of dashboard page, and that the webserver, wrk2, telegraf, and InfluxDB are all still running.

Grafana specifics: time filter and time interval

$timeFilter, seen in panel query builder, stands for time range picked in Grafana UI, the period of time for which you want to see metrics.

$__timeInterval, seen in panel query builder, stands for time interval. Time interval depends on what's the time range you're looking at in dashboard, meaning the time interval between the two nearest dots on series. Looking at a 1 hour range in dashboard (put from=now-1h&to=now in URL or use the time range picker in the upper right corner of dashboard), I see a dot at 11:01:02. The next dot is 11:01:06, so the time interval is 4 seconds.

The time interval is important to use so that the dashboard for a large range (e.g. 30 days) loads in a reasonable amount of time.

Telegraf is configured to send data from uWSGI stats server to InfluxDB once every 10 seconds. If the time interval is bigger than 10 seconds, one dot in the series corresponds to avg/sum/... (set in query which one) of multiple uWSGI stats readings (those that got into time interval).

uWSGI stats - by metric

Harakiri count

uWSGI has a useful feature, to kill worker if worker executes request for longer than defined time (e.g. 10 sec). It's called "harakiri". To configure it, set uwsgi option harakiri=10 (in uwsgi.ini), where 10 is the time of the longest allowed request in seconds.
This option can be used to protect from DDoS attacks that exploit long-executing requests. Such attacks flood a website with long-executing requests to such an extend that the website doesn't have the capacity (free workers) to serve regular user traffic, since all the workers are busy executing attacker's requests.
Setting low harakiri can bite you if the web server is expected to serve long-executing requests in some rare cases. One needs to analyze what's the longest valid request time. Consider refactoring code to avoid long-executing requests in worker processes. There are lots of options depending on the specifics of the problem: uWSGI spooler to send emails, uWSGI mules, or some kind of task queue for long-executing requests that's separate from the webserver.

Harakiri panel query is as follows:

SELECT sum("harakiri_count") FROM "uwsgi_workers" WHERE $timeFilter GROUP BY time($__interval) fill(null)

This will shows sum of harakiri event for time interval.

Worker status

uWSGI worker can be in few states:

  • idle (not working on a request)
  • busy (working on requests)
  • cheap (see uWSGI cheaper subsystem docs)

I configured Worker status panel to show idle, busy, cheap and total amounts of workers.

Query:

SELECT count("status") FROM "uwsgi_workers" WHERE  $timeFilter and "status"='busy' GROUP BY time($__interval) fill(null)

for "busy" series, for other series replace "busy" with "idle" or "cheap", for total - omit status clause (... WHERE $timeFilter GROUP BY ...).

Can configure this panel in different way - to show worker busyness in percent for previous period of time. I thought this approach more useful for uwsgitop and more confusing when metrics are monitored in time by separate monitoring stack.

Listen queue size

If all uWSGI workers are busy working on requests while new requests are arriving, those new ones are first put in queue (socket listen queue) to wait for next free worker.
The size of the listen queue is configurable using the option listen=64 but max allowed value depends on system max socket listen queue size, so you might need to increase system value first.

Panel query is:

SELECT sum("listen_queue") FROM "uwsgi_overview" WHERE $timeFilter GROUP BY time($interval)

The number of workers

The number of workers uWSGI is currently running. Makes more sense when cheaper subsystem is in use, or when cluster has multiple web server instances with scaling (all reporting to same DB) - then this count changes.

The query to get the number of workers with IDs 1 - 4:

SELECT count("avg_rt") FROM "uwsgi_workers" WHERE $timeFilter AND ("worker_id"='1' OR "worker_id"='2' OR "worker_id"='3' OR "worker_id"='4') GROUP BY time($__interval) fill(null)

This is an indirect metric that counts how many times avg_rt metric was reported for "worker_id"='1' during the time interval. It the time interval is bigger than 10 sec (telegraf queries uWSGI stats once every 10 sec) - e.g. when time range you're looking at is 6 hours, this actually shows incorrect data.
A question to figure out if you're curious and don't mind digging into Grafana docs: how does one make it show correct data regardless of chosen time range? Comment below if you find out.

Worker requests

How much requests one worker has executed since worker process was started. When webserver serves requests, this measurement rises smoothly. When a worker restarts, it falls to zero.

Query is:

SELECT mean("requests") FROM "uwsgi_workers" WHERE $timeFilter GROUP BY time($interval), "worker_id"

In-request sum

How many requests uWSGI is working on right now, across all workers, sum for time interval.

Query:

SELECT sum("in_request") FROM "uwsgi_cores" WHERE $timeFilter GROUP BY time($interval)

Respawn count

Query:

SELECT sum("respawn_count") FROM "uwsgi_workers" WHERE $timeFilter GROUP BY time($interval), "uwsgi_host"

How many times workers were respawned since uWSGI start.

For production system it's recommended to gracefully respawn uwsgi workers after some time, to avoid excessive memory consumption of long-living processes, etc.

To limit amount of executed requests after which to respawn, add in uWSGI options:

max-requests=10000
max-requests-delta=1000

This will respawn worker after 10000+-1000 requests. Delta is a random value added to/substracted from max-requests. Its purpose is not to respawn all workers at the same time.

To limit the amount of time passed after which to respawn, add in options:

max-worker-lifetime=36000
max-worker-lifetime-delta=3600

Lifetime values are in sec. This will respawn a worker if 10h+-1h of time has passed since last respawn. Delta is a random value added to/substracted from max-lifetime used in order to not respawn all workers at the same time.

One can use both limits together; whichever one occurs first for particular worker will take effect.

Worker avg request time

Query:

SELECT mean("avg_rt") FROM "uwsgi_workers" WHERE $timeFilter GROUP BY time($__interval), "worker_id" fill(null)

Amount of time worker spends on request, on average, per each worker.

Worker running time

Query:

SELECT mean("running_time") FROM "uwsgi_workers" WHERE  $timeFilter GROUP BY time($interval), "worker_id"

How long is worker running; time since the last respawn.

Load sum

Query:

SELECT sum("load") FROM "uwsgi_overview" WHERE $timeFilter GROUP BY time($interval)

I am 100% sure but this is something like amount of incoming requests minus outgoing responses. It rises dramatically when web server is overloading, and usually is zero.

Conclusions

At this point, we have a basic uWSGI web server and a monitored playground to watch over it while we experiment with uWSGI configurations.

I encourage you to try out the effects of changing uWSGI options on the web server. Start by setting max-requests=10 in uwsgi.ini and compare how that changes stats for web server under load. You can also check how it affects the results seen in load tool (wrk2) summary.

Posted on by:

cheviana profile

Jane Radetska

@cheviana

Web Developer from Kiev, Ukraine, currently living in Washington, DC. Book worm. Formula 1. Swing dancing. Harmony. @cheviana on github

Discussion

pic
Editor guide