Thinking about monitoring your container environment?
When to consider Prometheus or InfluxData
Prometheus, originally built at SoundCloud in 2012, is an open-source monitoring and trending system which includes a time-series database. The project has a small set of resourceful engineers, and it is used primarily in the context of modern cloud-based and containerized architectures.
In March of 2016, the Cloud Native Computing Foundation was formed with Kubernetes as its anchor project and Prometheus added to the foundation in May of the same year. For organizations considering or actively using Kubernetes, evaluating Prometheus for monitoring makes a lot of sense based on the close collaboration between projects within the foundation.
InfluxData is the commercial company behind the open source TICK stack. The TICK stack is an integrated platform made up of:
- Telegraf for metrics gathering,
- InfluxDB for high performance time-series data storage and analysis,
- Chronograf for visualization including pre-built dashboards,
- Kapacitor for monitoring, anomaly detection and alerting.
The TICK stack was built from the ground up to be a complete real-time monitoring and analytics platform. Developers use InfluxData’s platform to deliver an always-on, consolidated view of their data and the visibility and insight into metrics, events, and log information emitted from IoT devices, legacy and modern compute infrastructure systems (including cloud and containers), applications, and other sensor data.
InfluxData supports commercial products, InfluxEnterprise and InfluxCloud, which are powered by the TICK stack, and include enterprise grade features such as high-availability and scale-out. InfluxEnterprise is deployed by customers on the platform of their choice and InfluxCloud is our managed services offering.
In terms of our experience in storing data, InfluxDB 0.8 used an underlying storage engine based on LevelDB. Our experience operating LevelDB at scale showed that the key value store was the cause of numerous bottlenecks and performance issues. This led to the development of a custom log structured merge tree known as Time Structured Merge Tree (TSM) which was introduced with InfluxDB 0.10. Currently, the Prometheus storage engine is comprised of LevelDB that stores the indices, while each time-series is stored in an individual file. The published benchmarks for Prometheus indicate it requires a 32 CPU/64GB RAM node to write 500K metrics/s across 1.4M series. The same workload on InfluxDB 1.x can be easily accomplished using a 8 CPU/16GB RAM node.
One criticism of many time-series databases is that series keys must be kept in memory. The ability to monitor ephemeral series like per container or per process metrics can generate a large number of series keys which is called high cardinality time series. Prometheus and InfluxDB both currently utilize this in-memory design pattern. What this means is even when using Recording Rules within Prometheus or the equivalent capability within InfluxDB (called Continuous Query or CQ) to aggregate and/or downsample data, the process of gathering, storing and analyzing large number of metrics may require the user to define aggressively short data retention policies due to these high cardinality concerns. In situations where the data being stored has high cardinality, the time series keys may exceed the memory available to database unless the amount of data being is stored is regularly reduced.
InfluxData has been developing a time-series index that addresses the high cardinality issue eliminating this problem. It does not appear that anything similar exists on the Prometheus roadmap, however, it does appear that InfluxDB is being considered as a potential solution to this challenge.
Choosing a monitoring solution
When selecting a monitoring solution, it is critical to understand both your near-term and longer term requirements to avoid unnecessary rework and frustration.
While there are many factors that go into choosing a solution, here are five important ones:
- Availability and Ease of Scale Out
- Value of Historical Analysis
- Regularity of Data
Availability and Ease of Scale Out
InfluxDB and Prometheus have the option to set up multiple instances of the underlying database to ensure high availability — populating two separate database instances with the same metric content. But, installing multiple instances of the underlying database increases the complexity of the overall monitoring solution, introducing the possibility of data inconsistency between the instances, and increasing the things you need to monitor.
InfluxData believes that clustering is a better solution to address this challenge and this functionality is included in both InfluxEnterprise and InfluxCloud. Clustering the database delivers:
- high availability – eliminating single points of failure,
- reduced complexity associated with the setup, administration, data consistency challenge and ongoing management and maintenance of multiple disconnected database targets,
- easy scale out capability.
Clustering provides automated replication, rebalancing and failover, while delivering data consistency across the cluster and zero downtime upgrades. For enterprises who fundamentally require 24x7 visibility to their infrastructure (such as SasS providers), the clustering capability is essential. Clustering can deal with disk failures, network card outages, SSD failures, and infrastructure upgrades with zero downtime.
If you need additional compute and/or storage capabilities it is trivial to add a new node to the existing cluster setup and continue to scale over time, with no configuration changes.
With Prometheus, if the volume of metrics being collected increases and creates performance bottlenecks within the database (either for writes or queries), the recommended approach is to segment the collection of metrics, to additional and separate instances of the underlying database further increasing the complexity of configuration and maintenance. This complexity creates challenges for leveraging the data as users may not understand which instance to access in order to gather the metrics they need.
Value of Historical Analysis
Monitoring data is valuable in both production and non-production environments. While the value of storing monitoring data generally decreases over time, it is critical to determine what your requirements are and how this data needs to be leveraged. Three of the most popular use cases associated with leveraging historical monitoring data are:
- Capacity Planning: While the precision of the monitoring data can be reduced over time, there is still tremendous value in leveraging historical information to determine whether the capacity you have provisioned is sufficient for your needs.
- Performance Regressions: As new versions of systems rollout into production,it is important to be able to spot performance regressions from previous versions.
- Root-Cause Analysis: Having the ability to browse a larger data set, being able to visually correlate repeating or potentially interrelated issues across a series of metrics can be extraordinarily valuable in isolating a particular root-cause and addressing it in a timely manner.
Regularity of Data
When dealing with time-series data, regularity is about whether a set of metrics has regularly spaced time periods or irregularly spaced time periods. Gathering a set of metrics every 3 seconds is an example of collecting metrics at a regular time period. Examples of metrics which are generated across irregularly spaced time periods include the startup of a container or the HTTP requests between microservices. Irregular time series are event driven, while regular series are samples.
Both Prometheus and InfluxDB excel at the storage of ongoing metrics being collected across regularly spaced time periods, but InfluxDB is designed to allow for the storage of all time-stamped events along with a rich set of metadata. In a growing number of monitoring use cases including microservices and ephemeral containers, cloud-based infrastructures, and server-less architectures such as lambda, events are the critical link in understanding what is happening within and across these interrelated systems. InfluxDB provides a clear advantage in this regard.
Monitoring solutions need to have the extensibility for your unique environments. For example, the ability to quickly create custom collectors or plug-ins for unique systems or setting up integrations with other operational and legacy infrastructure.
One particular area of extensibility which continues to receive lots of attention is around alerting. Prometheus’ AlertManager is a capable alert system with much of the same capability as InfluxData’s Kapacitor. However, Kapacitor has the ability to load custom functions to manipulate the metrics themselves. This allows for the injection of more sophisticated business logic into the alerting process. InfluxData has deeper integration with Kubernetes, and is capable of much more than just simple alerting. For example, capturing event-based metrics allows for Kubernetes auto-scaling, something that Prometheus isn’t capable of delivering. Kapacitor also delivers predictive alerting via Holt-Winters forecasting which can help head-off potential failure scenarios before they occur within your environment. InfluxData supports machine learning systems like TensorFlow, as described in this video.
InfluxData, being a commercial entity, has a variety of support options available from free and open collaboration within the community to paid support offerings and a fully managed, cloud-based instance (InfluxCloud) to provide real-time monitoring and analytics. We have a vested interest in ensuring the community and our customers are successful with the TICK stack.
As part of the CNCF, Prometheus issues are managed on a case by case basis via GitHub ticketing. The ability for consumers of Prometheus to troubleshoot issues and request feature additions are left to the foundation members to determine how best to deliver this, typically on a best effort basis. Commercial support for Prometheus is currently limited to a couple of consulting firms and independent contractors.
While Prometheus and InfluxData can both be used to effectively monitor modern cloud-based and containerized architectures, InfluxData’s integrated platform of Telegraf, InfluxDB, Chronograf, and Kapacitor, is the clear choice if any of the following requirements are important to you:
- High availability or the ability to meet service level agreements around uptime
- Data consistency between underlying data stores
- Ability to store historical data for root-cause analysis, capacity planning, or performance regressions
- Ability to handle time-stamped events AND a rich set of metadata
- Peace of mind in having a commercial entity supporting your deployment
- Ability to extend the platform to allow for custom alerting, integration with machine learning for predictive analytics and anomaly detection, supporting legacy system monitoring and other ad hoc monitoring sources