Reducing MTTR for DevOps and SREs with PagerDuty Process Automation and InfluxDB

Navigate to:

Mean time to resolution (MTTR) is a metric that transcends industry and technology. It’s a measure of how quickly, on average, support teams identify, act, and resolve IT issues and incidents. Because MTTR directly relates to service quality, maintaining a low MTTR is a critical goal for DevOps and SRE teams. These teams have a vested interest in resolving issues quickly because escalating incidents to higher levels of the support team increases response and resolution times. Resolving issues fast, and doing so consistently, creates better end user experiences, reduces errors, and makes organizations more efficient overall.

Identifying the process

Having tools to quickly identify, assess, and fix issues reduces the impact of incidents and outages on end users. To exercise better control over incident management, the PagerDuty team started by breaking down the process into stages. These include a monitoring stage, an incident management stage, and a runbook execution stage, and they sought a best-in-breed solution for each one.

What they landed on was to use InfluxDB to handle monitoring. They use the InfluxDB platform in several different ways. They deploy Telegraf throughout their infrastructure to monitor all the different systems. Solution Consultant Craig Hobbs built an InfluxDB template for these deployments, because they have everything you need to set up data collection, including Telegraf and associated plugins, in a matter of minutes.

Building with best-in-breed solutions

The Telegraf instances send data to InfluxDB, which processes that high-volume time series data and intelligently dispatches all the triggers and alerts based on that data. InfluxDB’s ability to handle high-volume and high velocity time series data allows it to cut through noisy data and identify those incidents that actually need attention.

IT-infrastructure-monitoring

These triggered alerts go to PagerDuty, which powers the incident management stage. PagerDuty is able to orchestrate and aggregate all the information about the issue and to determine which runbook is necessary to resolve the issue. Then PagerDuty Process Automation executes the proper runbook.

PagerDuty Process Automation not only executes the runbook, but it also sends the data generated from that event back to InfluxDB and PagerDuty so that those tools can refine their application logic and perform better in the future.

Benefits and efficiencies

This whole process increases the level of automation in incident management, making it more efficient and effective. InfluxDB provides visualization for the data coming from both Telegraf and PagerDuty, so users get a system-wide view in a single location. This setup is extremely adaptable and extendable, due in large part to the ability of InfluxDB and Telegraf to integrate with virtually any data source. These tools provide developers with the flexibility they need and the control they want. And the use of templates make the solution quick and easy to deploy.

By bringing together best-of-breed solutions, both open source and proprietary, PagerDuty created a solution that’s flexible for developers and meets the demands of growing data volumes, various stakeholders and escalations, and the complexities involved in infrastructure monitoring and auto-remediation. It also helped reduce the number of incidents and overall MTTR.

For more information, read the full case study.