Moving Towards AIOps with Real-Time Full Stack Visibility, Intelligent Alerting and Response Automation
By Daniella Pontes / Nov 19, 2019 / InfluxData, Community, Developer
InfluxData integration with PagerDuty takes you down the road of excellence managing your Kubernetes environments.
There is no one single playbook for all types of incidents that is granted. Therefore, it is necessary to have the flexibility and adaptive learning about the environment and appropriate human actions to address issues promptly and effectively. Kubernetes’ ephemeral pods and self-healing nature makes even more imminent the need for intelligent alerting because Kubernetes is naturally in constant change, therefore, an isolated failure or error is not enough to make a decision about triggering an alarm. This does not mean to over-complicate things with a myriad of static alerting rules trying to adjust to patterns. And it’s important not to depart from “simple” when simple can do the job, as in well-understood risk thresholds in resource constraint situations. Rather, it means embracing the power of analysis of historic data to be adaptive, handle multidimensional risk scenarios and event alerting when relying solely on static isolated thresholds would be insufficient. Such reliance would leave you with two out-of-measure options: an excess of alerts or a dangerous frugality, which respectively leads to either alert fatigue or ineffective trigger rules.
In modern IT operations, incidents can evolve quickly to a devastating cascading effect with large business impact, both to your company’s finances and brand. Therefore, response and repair time to normality cannot be delayed. Every second counts in thousands of dollars. Actions should be automated for mature workflows, be planned with a preemptive, prescriptive and predictive approach. That is where the integration of the InfluxData real-time, full-stack monitoring and data analytics platform and the PagerDuty incident response platform becomes powerful. This integration provides real-time visibility of technical and business insights enabling stakeholders to act in time.
Kubernetes can benefit from high-fidelity monitoring data applied to alerting. That is because, although Kubernetes orchestration will continually work towards the desired declared state, problems could be developing in the dark. Therefore, it is necessary to monitor and preemptively alert on problematic workloads or environmental issues such as network load, storage, long response time of services, and so on.
Real-time monitoring, intelligent alerting and orchestration of the appropriate response for each event is the road that leads to excellence in operations and user experience. Having the right solutions in place is important, and having them integrated lets you take full advantage of data-driven workflows, on the operations as well as business levels. Both operations goals related to infrastructure and application environment and business goals addressing user experience, critical transactions and request failures should be used to compose effective trigger rules and avoid false positives. Leveraging a time series platform that provides full-stack historic data to dynamic define performance baselines is fundamental to reduce alarm noise and fatigue.
The integration of InfluxData with PagerDuty enables alert thresholds and triggers to be set dynamically using data analytics and correlation of data from multiple monitoring measurements. Indeed, advanced data analytics (such as Holt-Winters forecasting) is natively supported by the InfluxData platform. This enables detection of trends and seasonality, and moreover, shows the impact on high-level business KPIs, to be used to define static and dynamic alerts. It also provides visibility into the metrics, logs, performance indicators and metadata leading the way to a quick diagnosis and faster mean time to resolution (MTTR).
Simple static thresholds (for infrastructure and application metrics) must be complemented with dynamic thresholds (suitable for fluctuations found in ephemeral and highly adaptive environments like Kubernetes, but also in business seasonality and trends). Insights derived from statistical data analytics or machine learning frameworks can provide smarter alerting triggers. Such triggers when integrated with a modern event handling solution can direct calls-to-action as well as perform automatic escalation and optimization of digital operations. That means not only metrics but also high-fidelity data must feed data analytics engines for intelligent alerting and must be kept “fluid” to serve multiple frameworks all the way to an end-to-end response automation as an ultimate goal. InfluxData’s integration with PagerDuty empowers organizations to move towards AIOps by applying data-driven action flows from alert triggering through the entire incident lifecycle management.
About InfluxData InfluxData is the creator of InfluxDB, the open source time series database. The technology is purpose-built to handle the massive volumes of time-stamped data produced by IoT devices, applications, networks, containers and computers. The company has more than 600 customers and is on a mission to help developers and organizations, such as Cisco, IBM, PayPal, and Tesla, store and analyze real-time data, empowering them to build transformative monitoring, analytics, and IoT applications quicker and to scale. InfluxData is headquartered in San Francisco with a workforce distributed throughout the U.S. and across Europe. For more information, visit www.influxdata.com and follow us @InfluxDB.
About PagerDuty In an always-on world, teams trust PagerDuty to help them deliver a perfect digital experience to their customers, every time. PagerDuty is the central nervous system for a company’s digital operations. PagerDuty identifies issues and opportunities in real time and brings together the right people to respond to problems faster and prevent them in the future. From digital disruptors to Fortune 500 companies, over 12,000 businesses rely on PagerDuty to help them continually improve their digital operations so their teams can spend less time reacting to incidents and more time building for the future.