Building a Scalable End-to-End Host Monitoring Solution with InfluxDB Enterprise
By Chris Churilo / Jan 10, 2020 / Community, Developer, InfluxDB Enterprise
“If you can’t measure something to get results, you can’t possibly get better at it. Worse yet, you won’t know what you should be focusing on,” says Dennis Brazil, Sr. Engineering Manager, SRE Monitoring at PayPal. Brazil and his team needed a scalable end-to-end host monitoring solution to keep pace with the company’s infrastructure modernization to a container-based architecture.
The new monitoring solution, which would replace the company’s antiquated monitoring systems, needed to work with containers and to provide metrics collection, storage, alerting and visualization all at once since the team’s preference was to select a single-vendor platform. Central to this new solution, they realized they needed a time series database because “time series data helps us make educated, data-driven decisions quickly. It is what keeps us in business,” says Brazil.
PayPal chose InfluxData’s InfluxDB Enterprise and leveraged all components of InfluxData’s platform to build a solution using Telegraf aggregators, message queues, and publishers in order to control data payload size, manage message flow, and avoid single points of failure (SPOF).
Here’s an overview of why PayPal chose InfluxDB Enterprise and how they used it in their host monitoring solution.
In search of a host monitoring solution
PayPal, whose platform enables digital and mobile payments on behalf of consumers and merchants in more than 200 markets worldwide, sought a scalable host monitoring solution that would keep pace with the company’s dynamic and ever-expanding infrastructure.
PayPal has nine data centers, with 30,000 instances each, and they all have their own clusters. The company was migrating all their old applications some 20 years old and compiled in C++ into containers and more modern operating systems, with many of their Docker hypervisors hosting 50 to 100 containers at once.
Host monitoring solution requirements
For their new hosting monitoring solution, PayPal set four technical requirements and wanted to meet them through one vendor:
- A reliable and extensible agent sitting on all systems to monitor basic OS system metrics such as CPU, Disk, Memory, third-party applications and databases
- Time series database backend for reporting history
- Ability to monitor multiple Docker containers with a single agent (critical to keep the agent's overhead down across the whole system)
- Smart alerting based on time series data
Building a host monitoring solution using InfluxDB Enterprise
InfluxData’s InfluxDB Enterprise, which turns any InfluxData instance into a production-ready cluster that can run anywhere, provided an end-to-end solution from one vendor since it includes all the components of InfluxData’s platform. By selecting InfluxDB Enterprise, PayPal gained an all-in-one metrics collection, storage, alerting and visualization functionality.
- Telegraf provides an extensible plugin-based architecture for monitoring for all OS's, applications, and Docker containers.
- InfluxDB provides a fast, scalable time series database.
- Chronograf has a user interface with an intuitive data explorer and query builder.
- Kapacitor provides smart alerting capabilities.
PayPal also valued the deployment simplicity, scalability and customization that InfluxData’s architecture allows as well as InfluxData’s technical support.
In their new host monitoring solution, PayPal used Telegraf aggregators, message queues, and publishers. Their journey to scalability involved three iterations to reach their current solution, which is shown below.
<figcaption> Technical architecture using InfluxDB Enterprise and Telegraf Agents</figcaption>
In PayPal’s new host monitoring solution:
- Message Queues (MQ's) prevent data loss when the database is unavailable (by retaining data until the consumers consume the data).
- Smart publishers watch for back-pressure and back-off until the cluster is available.
- The new setup prevented immediate single point of failure (SPOF) condition with a replication factor of 3 with more data nodes (they don't consume messages or publish until the database is made available again).
Using InfluxDB Enterprise, PayPal built a resilient monitoring solution that works at scale, and in the process, they derived several conclusions regarding best practices for scaling InfluxDB Enterprise clusters.
Learn more by reading the full case study.
If you’re interested in sharing your InfluxDB story, click here.