Building a Metrics & Alerts as a Service (MaaS) Monitoring Solution Using the InfluxDB Stack
By Chris Churilo / Jun 25, 2020 / InfluxDB, Community, Developer
The larger an enterprise becomes, the more systems and applications there are to monitor, and the more scalable its monitoring system has to be to keep up with business growth. This is the challenge that RingCentral which provides cloud-based communications and collaboration solutions for businesses faced and solved.
Using InfluxDB Enterprise, Kapacitor, and Telegraf, RingCentral built a monitoring solution that supports visibility, integrated configuration and alerting for operations efficiency, and quick DevOps cycles for the four pillars of its product (Cloud PBX, contact center, video and meetings, and team messaging) as well as the functionalities built on top of these pillars. The monitoring solution that RingCentral built also provides Metrics & Alerts as a Service (MaaS) to its developers and operations engineers through a DIY framework so that they can self-service their monitoring needs.
Here are some highlights from RingCentral’s solution development journey, beginning with the nature of their business and the problem they set out to solve.
Gaining visibility through Metrics and Alerts as a Service
RingCentral, a global California-based IP telecom company, works with its customers to reimagine business communications and collaboration. In a demanding industry where customers have zero tolerance for downtime, RingCentral is providing reliable solutions, with collaborative communications at the heart of everything it does.
As RingCentral grew and as its IT infrastructure became more complex, tracking and understanding the information within the IT environment became increasingly vital. “Our monitoring team, currently, is very small versus the engineering team which is growing because we have to address all our business needs and all new features,” says Yuri Ardulov, Principal System Architect at RingCentral.
Despite the small size of the monitoring team, the company’s monitoring infrastructure had to cope with its growing operations and telecom services business.
Additionally, RingCentral established a goal to streamline their processes to more effectively manage development, configuration alterations, as well as metrics and events collection of their ever-growing application environment (which consists of 400+ different “homemade” components continuously developed by a team of 1,500 developers).
RingCentral decided to provide a programmable way for developers and operations engineers to self-service their monitoring needs: monitoring of their “homemade” systems and their operational layer. The monitoring team set out to provide a platform and tool sets for the other teams to send metrics and set up alerts of their interest.
Monitoring solution with HA and metrics granularity
To achieve their do-it-yourself (DIY) framework, certain technical requirements had to be met:
- Alerting and dashboarding as code
- Send application metrics without structure requirements
- Horizontally scalable infrastructure with high-availability clusters
- Sandbox to test new code before it gets released
- No hard limitations on cardinality or type of metrics
- Fully automated service integrating with deployment systems
- No single point of failure
At the time, they were using a Zabbix monitoring tool set but had outgrown its capacity and needed to replace it with a solution that provides high availability (HA) and metrics granularity. As a first step, they migrated to the open source InfluxDB platform. After initial evaluation, they deployed:
- InfluxDB Enterprise to handle their metrics and event volume growth, ( the Enterprise edition of InfluxDB provided the high availability, scalability and metrics granularity that Zabbix lacked)
- Telegraf as the agent installed in every host (physical or virtual) to collect monitoring data
- A Kapacitor pool for no downtime, to meet their alerting requirements (so no trigger event would pass unnoticed)
- An in-house built Kapacitor Manager to manage their pool of Kapacitor instances
InfluxDB as RingCentral's Metrics & Alerts as a Service platform
To meet the above-mentioned DIY technical requirements, RingCentral decided to introduce, to all their engineering team, what they called a “Service Manifest”. Through this functionality, the developers and operations engineering team would be able to express their metrics and alert requirements as a code.
The manifest gets compiled by RingCentral’s system and promulgates along with the code itself. It successively goes through each stage and then enters the production system automatically without any separate installation or configuration.
<figcaption> Trigger Manifest example</figcaption>
RingCentral's monitoring solution architecture
RingCentral has separate development, performance testing, and staging environments. The below chart shows the design of their system which they plan to install in each of these environments:
<figcaption> Design of RingCentral’s monitoring solution</figcaption>
- InfluxDB Enterprise cluster and the HAProxy are in front to ensure that all incoming metrics will be balanced there.
- Telegraf is installed on each of the hosts (virtual or physical) for collecting all metrics.
- Through the deployment system, Service Manifest compiles during the release cycle all of the appropriate configuration that will be delivered on the host on one side.
- All the data is supposed to be delivered for alerting to Kapacitors.
Because one of the requirements was no single point of failure, a two-Kapacitor task runs on the different Kapacitors to duplicate the alert if necessary. As a result, they have a Kapacitor pool to satisfy 50K+ triggers per environment.
Building Kapacitor Manager in-house
To meet their alerting requirements for the Kapacitor pool, RingCentral designed Kapacitor Manager.
<figcaption> Kapacitor Manager (KM): functional description</figcaption>
Kapacitor Manager consists of a few instances which are combined and based on the persistent queue. It provides three types of APIs as shown above: Task management API, Kapacitor nodes management API, and Jobs API.
There are quite a few queue workers able to pick up the different tasks from the queue and execute them so the queue is persistent. RingCentral is planning to run Kapacitor Manager for each of their locations.
Detecting and generating the change event
RingCentral’s OCP system is a single-core processing system and has very low performance and very bad scalability. This required that they build, around their Kapacitor pool, a mechanism for detecting the event itself (thereby the change in the event status). They also needed to mollify the load on the event processor and therefore decided to keep the status of events inside InfluxDB.
So they deployed, inside their InfluxDB cluster, two databases: the metrics DB where all the metrics are collected, and the events DB. Their Kapacitor nodes, as a result, split into the two categories. In their creative throttling solution, InfluxDB and Kapacitor regulate the event flow to not choke the complex event processing (CEP) system and provide a safenet for event recording.
A monitoring infrastructure to keep pace with business growth
Since migrating to InfluxDB, RingCentral’s monitoring system has come a long way. Their scalability model needs to match the dynamic nature of their application environment, and their choice of InfluxDB platform served them well since its components can be combined with other modern systems, like Kubernetes.
Choosing InfluxDB Enterprise has freed RingCentral’s monitoring infrastructure to keep pace with its business growth. This choice, in combination with deploying Telegraf for metrics collection and Kapacitor for alerting, enabled their small-size monitoring team to build Kapacitor Manager and achieve an automated real-time alerting system for their complex operations. To learn more about this compelling DevOps monitoring use case, read the full case study.
If you’re interested in sharing your InfluxDB story, click here.