Adding monitoring to your stack is one of the quickest ways to get visibility into your application, letting you catch issues more quickly and begin to make data-driven decisions about where to invest your engineering efforts. One of the most straightforward things to start monitoring is your processes themselves—it can be hard for a web server to serve requests if no processes are running, and on the flip side you can quickly deplete your available resources by running more copies of a program than you intended.
On most ’nix systems, process information can be gathered in multiple ways. Depending on which specific OS you’re running, you might be able to look at the
proc filesystem, which contains a number of files with information about running processes and the system in general, or you could use a tool like
ps, which outputs information about running processes to the command line.
In this example we’ll use Python and Ubuntu Linux, but many of the concepts will carry over to other languages or applications and operating systems.
Getting Info about Processes on Linux
One great place to get more information about your system is the
proc filesystem, which according to the man page, “is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at
/proc.” If you visit that directory on a Linux machine, you might see something like this (output from a fresh install of Ubuntu 16.04.3):
$ cd /proc $ ls 1 1182 14 18 25 3 399 437 53 60 80 839 bus driver kallsyms locks partitions sys version_signature 10 12 141 19 26 30 4 47 54 61 81 840 cgroups execdomains kcore mdstat sched_debug sysrq-trigger vmallocinfo 1017 1200 15 2 266 31 413 48 544 62 817 890 cmdline fb keys meminfo schedstat sysvipc vmstat 1032 1230 152 20 267 320 414 480 55 66 818 9 consoles filesystems key-users misc scsi thread-self zoneinfo 1033 1231 16 21 277 321 420 49 56 671 820 919 cpuinfo fs kmsg modules self timer_list 1095 1243 164 22 278 369 421 5 57 7 828 925 crypto interrupts kpagecgroup mounts slabinfo timer_stats 11 126 165 23 28 373 423 50 58 701 831 acpi devices iomem kpagecount mtrr softirqs tty 1174 128 166 24 29 381 425 51 59 79 837 asound diskstats ioports kpageflags net stat uptime 1176 13 17 241 295 397 426 52 6 8 838 buddyinfo dma irq loadavg pagetypeinfo swaps version
There’s a lot there! The first thing you’ll notice is a series of numbered directories; these correspond to running processes, and each directory is named with the “Process ID”, or PID, of that process. You’ll also see a number of other directories and files with information about everything from kernel parameters and loaded modules to CPU info, network statistics and system uptime. Inside the directories for each process you’ll find almost as much information about each individual process—too much for our use case. After all, we just want to monitor whether or not the process is running, and maybe how many copies are running.
When a system administrator logs into a server to verify that a process is running, it’s unlikely that
/proc would be the first place they turn. Instead, they’d probably use a tool like
ps, which also provides information about running processes. There are a few different versions of
ps that you might use, but for the version on Ubuntu you can use the following command to get information about all running processes on your server:
$ ps aux
We’re going to use Python to create a few processes for testing purposes. Since we don’t really need these to be doing any work, we’ll write a simple program with an infinite loop and a call to the sleep function in Python’s time module, in order to avoid using unnecessary CPU cycles. Make sure you have Python installed by entering the following command at the command line:
$ python --version Python 2.7.12
Since our program is so simple, it will work with either Python 2 or Python 3. Next, use a text editor to create a file called
loop.py with the following contents:
#!/usr/bin/env python import time while True: time.sleep(5)
The first line tells the operating system which interpreter to use to execute the script. If this program was more complex, or used functionality that differed between Python 2 and Python 3, we’d want to specify which version of Python we were using instead of just saying
Run this command from the same directory where the file is located to make the script executable:
$ chmod 744 loop.py
and then run the program, appending the
& character to the end of the command (which tells Linux to run the process in the background) so we still have access to the shell:
$ ./loop.py &  1886
After running a command using the
& character, the PID of the running process is listed in the output. If you run
ps aux again, you should now see a Python process with PID
1886 in the list of results.
On the Ubuntu server I am using, this command returned just over 100 results, and searching through that list manually is too inefficient. We can use another command,
grep, as well as some of Linux’s built-in functionality, to narrow down the results. The
grep command acts like a search filter, and we can use a Linux “Pipe”, the
| character, to send the data from the output of the
ps aux command to the input of the
grep command. Let’s try it out:
$ ps aux | grep python noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py noah 1960 0.0 0.0 14224 1084 pts/0 S+ 20:56 0:00 grep --color=auto python
First we’re getting information about all running processes, then we’re “piping” that data into the
grep command, which is searching for any lines that contain the string
python. Sure enough, there is our Python process,
1886, in the first line of the results. But what about that second line?
When we run the
ps command, the output includes the arguments we provided when each process was started; in this case,
--color=auto is added because Ubuntu has an alias that runs
grep --color=auto when you type
grep, and then the
python argument, which is the string we were searching for. So we’re searching for “python”, which means the string “python” will be included in the output of
ps for the
grep process, so
grep will always match with itself because it contains the string it is searching for.
A common workaround to this issue is to search for the regular expression “[p]ython” instead of the string “python”. This will cause
grep to match any string that starts with any of the letters inside the brackets, in our case only a “p”, followed by the letters “ython”. When we do this,
grep still matches the word “python”, because it starts with a “p” and ends in “ython”, but it does not match itself because “[p]ython” doesn’t match that pattern. Give it a shot:
$ ps aux | grep [p]ython noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py
Let’s start up another Python process and see what we get:
$ ./loop.py &  1978 $ ps aux | grep [p]ython noah 1886 0.0 0.2 24512 6000 pts/0 S 20:14 0:00 python ./loop.py noah 1978 0.0 0.2 24512 6004 pts/0 S 21:13 0:00 python ./loop.py
Two Python processes, two results. If we wanted to verify that a certain number of processes were running, we should just be able to count the lines outputted by our command; fortunately providing the
-c argument to
grep does exactly that:
$ ps aux | grep -c [p]ython 2
Let’s bring the most recent of the two Python scripts into the foreground by using the
fg command, and then kill it using <Ctrl+C>, and count the number of Python processes again:
$ fg ./loop.py ^CTraceback (most recent call last): File "./loop.py", line 6, in time.sleep(5) KeyboardInterrupt $ ps aux | grep -c [p]ython 1
Perfect! One is the number we were looking for.
There’s another command,
pgrep, which also fulfills all the requirements of our use case, but it’s not as generally useful. It allows you to search for all processes which match a string, and returns their PIDs by default. It also accepts a
-c argument, which outputs a count of the number of matches instead:
$ pgrep -c python 1
Gathering Process Counts with Telegraf
Now that we know how to count the number of processes running on a server, we need to start collecting that data at regular intervals. Telegraf gives us a way to execute the same commands that we’re using at the shell in order to collect data in the form of the
exec input plugin.
exec plugin will run once during each of Telegraf’s collection intervals, executing the commands from your configuration file and collecting their output. The output can be in a variety of formats, including any of the supported Input Formats, which means that if you already have scripts that output some kind of metrics data in JSON or another of the supported formats, you can use the
exec plugin to quickly start collecting those metrics using Telegraf.
If you don’t already have Telegraf installed, you can refer to the installation documentation here. After following the instructions for Ubuntu, you should find a config file located at
For the purpose of this example, we’re going to write the output to a file, so we want to edit the
[[outputs.file]] section of the config, like so:
# # Send telegraf metrics to file(s) [[outputs.file]] ## Files to write to, "stdout" is a specially handled file. files = ["/tmp/metrics.out"] ## Data format to output. data_format = "influx"
We’ll apply those changes by restarting Telegraf, then check that metrics are being written to
/tmp/metrics.out. When installing Telegraf from the package manager, the
system input plugin is enabled by default, so we should start seeing metrics immediately:
$ sudo systemctl restart telegraf $ tail -n 2 /tmp/metrics.out diskio,name=dm-0,host=demo writes=7513i,read_bytes=422806528i,write_bytes=335978496i,write_time=23128i,io_time=9828i,iops_in_progress=0i,reads=9111i,read_time=23216i,weighted_io_time=46344i 1519701100000000000 diskio,name=dm-1,host=demo write_time=0i,io_time=108i,weighted_io_time=116i,read_time=116i,writes=0i,read_bytes=3342336i,write_bytes=0i,iops_in_progress=0i,reads=137i 1519701100000000000
exec plugin doesn’t know what to do with multiple commands, like we have above, so we need to put them into a simple bash script. First, create a file called
pyprocess_count in your home directory, with the following text:
#!/bin/sh count=$(ps aux | grep -c [p]ython) echo $count
This script serves a secondary objective besides allowing us to execute a piped command using the
exec plugin— if
grep -c returns zero results, it exits with a status code of 1, indicating an error. This causes Telegraf to ignore the output of the command, and emit its own error. By storing the results of the command in the
count variable, and then outputting it using
echo, we can make sure that the script exits with a status code of 0. Be careful not to include “python” in the filename, or grep will match with that string when the script is run. Once you’ve created the file, set its permissions so that anyone can execute it and test it out:
$ chmod 755 pyprocess_count $ ./pyprocess_count
Then move it to
$ sudo mv pyprocess_count /usr/local/bin
Next, we need to configure the
exec input plugin to execute the script. Edit the
[[inputs.exec]] file so it looks like this:
# # Read metrics from one or more commands that can output to stdout [[inputs.exec]] ## Commands array commands = [ "/usr/bin/local/pyprocess_count" ] ## Timeout for each command to complete. timeout = "5s" name_override = "python_processes" ## Data format to consume. ## Each data format has its own unique set of configuration options, read ## more about them here: ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md data_format = "value"
We’ve added the command directly to the command array, so it will be executed by Telegraf once per collection interval. We’ve also set the
"value", because the command will output a single number, and we use
name_override to give the metric a name.
Restart Telegraf again and then look at the
metrics.out file to see if our new metrics are showing up. Instead of searching through the file by eye, we can use
grep again to search for any lines with “python” in them:
$ grep python < /tmp/metrics.out python_processes,host=demo value=1i 1519703250000000000 python_processes,host=demo value=1i 1519703260000000000 python_processes,host=demo value=1i 1519703270000000000
We’re using the
< character to send the contents of the metrics file to the grep command, another Linux feature, and in return we get a few lines of metrics in InfluxDB line protocol, with the name of the metric, a tag for the host added by Telegraf, the value (with an “i” to indicate that it is an integer), and a timestamp.
If we bring up another Python process, we should see the value change in our output:
$ ./loop.py &  2468 $ grep python < /tmp/metrics.out python_processes,host=demo value=1i 1519703250000000000 python_processes,host=demo value=1i 1519703260000000000 python_processes,host=demo value=1i 1519703270000000000 python_processes,host=demo value=1i 1519703280000000000 python_processes,host=demo value=1i 1519703290000000000 python_processes,host=demo value=2i 1519703300000000000
And there we go! The final metric shows two Python processes running.
Writing metrics to disk isn’t a very useful practice, but it’s good for making sure your setup is collecting the data you expected. In order to make it actionable, you’ll need to send the data you collect to a central store somewhere so that you can visualize and alert on it.
The visualizations for these metrics would be minimal; we probably don’t need a full graph, since there shouldn’t be much variation in the data we’re getting that we need to look at historically. Instead, displaying a single number (for example, the Single Stat panel in Chronograf) should be enough to give you some confidence that things are working as expected.
How you alert on these metrics will depend on what exactly you’re monitoring. Perhaps you always want to have one copy of a process running. You could create an alert that sends an email every time your process count dropped below 1. After the first few alerts, though, your team will probably want to automate bringing up a new process if yours crashes, so you’ll need to tweak the alert so that some time needs to elapse between seeing the metric go to 0 and sending the first alert; if your automated system can bring up the process quickly, then a human doesn’t need to be contacted.
Or maybe you have a system that is regularly spawning new processes and killing old ones, but which should never have more than X processes running at a given time. You’d probably want to set up a similar alert to the one above, except instead of alerting when the metric drops from 0 to 1, you’d alert if the metric was greater than or less than X. You might want to give yourself a time window for this alert as well; maybe it’s OK if your system runs X+1 or X-1 processes for a short time as it is killing and bringing up new ones.
If you decide to send your data to InfluxDB, you can use Chronograf and Kapacitor to visualize and alert on your metrics. You can read more about creating a Chronograf Dashboard or setting up a Kapacitor alert on their respective documentation pages.