Metrics are Dead? Thoughts after Monitorama
Last week I attended Monitorama in Portland, Oregon. It’s an annual conference that focuses on… you guessed it, monitoring. Much of the content and sponsor focus is on DevOps Monitoring, but it’s not exclusively devoted to that. Many of the talks were applicable to other monitoring applications like user analytics or, increasingly, sensor data. The big theme from this year is that “metrics are dead” and “tracing and events are the now/future.” I wanted to take some time to look at why I think metrics will never die and to reflect on some of the other parts of Monitorama.
Logs, Metrics, and Tracing
Over the last year or so the monitoring community has settled in on what I like to call “the holy trinity of monitoring.” That is, logs, metrics, and tracing. Logs for the unstructured deeper root/cause analysis, metrics for summarization and time series data, and tracing for tracking performance and problems in distributed systems and microservices architectures. I’ve seen APM put into either the metrics or tracing bucket depending on who you’re talking to or what level of detail you’re looking at.
Many talks at the conference mentioned these three areas of monitoring focus. Tracing seemed to be the hottest subject with multiple talks devoted to it. There was also a focus on tracking discrete events to get to better root cause analysis. The idea being that you need high fidelity data to track down problems in distributed systems.
This is where the “metrics are dead” argument came in. Although I know the speakers were mostly being cheeky, there was an underlying point: metrics are lossy, and when you lose precision you’re unable to determine the cause of a problem and, potentially, unable to effectively monitor for when problems occur. However, the fact that metrics are lossy are exactly why they’ll always be a useful tool in any monitoring or analysis toolbox. Let me explain.
Metrics (what I call regular time series) are essentially a summarization of some underlying distribution or irregular time series. A regular time series is one that has samples at regular time intervals, like once every 10 seconds for example. Tracking CPU utilization is a great example from the DevOps space. An irregular time series is simply a collection of events with associated timestamps and values. For example, individual requests to an API and their response times. Regular time series can be induced from an irregular one using a sampling or aggregation function like min, mean, or percentile applied at regular intervals or windows of time.
If you’re dealing with low data volumes or drilling into something very specific, the full event log is very useful. However, when looking at larger volumes of data, summarization becomes vitally important. Take an API that gets millions of requests per day. As an operator, you’re completely unable to evaluate that number of requests. You’re also unlikely to be even be able to evaluate the worst performing requests. At that volume you’re likely to have hundreds or thousands of requests per day that fall outside some norm.
Metrics and regular time series give you a method for summarizing data so you can effectively visualize or analyze it at scale. Further, at certain scales it becomes prohibitively expensive to attempt to store a raw event stream. At the very least, using summarization to create a baseline will allow you to sample your event stream more intelligently for outliers. Metrics and summarization are vitally important when dealing with even moderately sized data.
Monitoring Inspiration from Unexpected Places
My favorite talk of the conference was the very first one given: John Rauser on ”The Tidyverse and the Future of the Monitoring Toolchain.” John is an excellent speaker, which isn’t the only reason I found his talk engaging. He talked about how ideas from the Tidyverse, specifically the ggplot2 and dplyer packages, could be brought to bear in the monitoring space. John showed how ggplot2 creates a vocabulary and API for specifying graph properties, which is exactly what we’re thinking about for future work on creating an open source graphing library based on our work in Chronograf. The examples using dplyer to transform and shape data are top of mind for me as I think about the future evolution of the InfluxDB query language. I can’t possibly do the talk justice, so watch it yourself:
I should mention that in the evening after Day 1, sections of downtown Portland had a 24 hour power outage because of an underground fire. This took out power for the venue, making it impossible to have day 2 events in the theater as planned. Somehow, through heroic effort, Jason Dixon, the other organizers, volunteers, and the Gerding Theater employees managed to relocate the conference for Day 2 to a new venue in less than 12 hours. I’m still amazed that they managed to pull it off.
Back to what I picked up as the major theme of the conference, tracing may be the new hotness in the monitoring space, but I think its audience is ultimately limited. Tracing is invaluable in a distributed/microservices environment, but the vast majority of applications built will continue to be monoliths. And in monoliths, metrics, events, and logs are your best tools. Most people don’t need microservices, but that’s a blog post for another time.
Overall the conference was a fantastic event and I enjoyed meeting and talking to users, other builders in the monitoring space, and even our competitors. The only ask I have for organizers going forward is to build in more break time so that we can have even more conversations between attendees. Either way, I’ll definitely be attending future Monitorama events.