Botmetric Journey: Choosing a Real Time Metrics DataStore That Works 

Navigate to:

This blog post was originally posted on Medium - posted here with permission from Vijay Rayapati of Botmetric.


Botmetric Journey: Choosing a Real Time Metrics DataStore That Works (From OpenTSDB, Cassandra To InfluxDB )

We started Botmetric in 2014 with a mission to offer “Intelligent cloud management for modern DevOps world.” We deliver an intelligent operations platform for cloud cost management, cloud security compliance and DevOps automation using the metadata from various operational sources like cloud providers, monitoring tools, logs, etc and apply algorithms to help businesses make decisions that will enable them to operate efficiently in the cloud world. With Botmetric, cloud customers reduce their overall cost, improve their security posture and automate day-to-day operations so their engineers can focus on business problems instead of solving every cloud operations and DevOps problem.

In order to deliver the promised value to our customers through smart insights and adaptive automation, we collect a lot of time series data from different sources in order to derive actionable insights and provide the operational productivity for DevOps teams. In this post, we want to share how we tried and deployed various technology stacks like OpenTSDB and Cassandra before moving with InfluxDB as one of our core data storage systems.

Our First Attempt (Circa 2014) - OpenTSDB

We were looking for a time series database solution that could scale and work well with our Java stack. We had zeroed in on using OpenTSDB after learning about the StumbleUpon use case as well as listening to several tech talks at the Monitorama conference. Some of our engineering team members had experience using HBase in production so we were to ready to try it out and deploy it for our use case.

While we liked the scalability aspect of OpenTSDB, just operating this system in production as our main data source had become a headache with the burden of Hadoop, HBase, ZooKeeper and OpenTSDB layers management. With so many moving parts in the OpenTSDB stack and the many hours spent on debugging the issues in production, it took more time than anticipated from our engineering team. So within 6 months, we realized that it’s not the right fit for a small and nimble team like us. The data aggregation was another major issue that was slowing down our development speed as we were building dashboards with cloud insights, cost analytics etc. Furthermore, the lack of a reliable failover at HBase in 2014 has caused data availability issues for us with hours of downtime in production, so we had to call it quits despite all our efforts to stabilize this system. On top of the operational issues, we didn’t have a reliable client library in Python as we started moving parts of the new Botmetric modules development away from the Java stack, which forced us to look for an alternative.

Our Second Attempt (Circa Late 2014 and early 2015) - Cassandra & KairosDB

In late 2014, we decided to move away from OpenTSDB and shortlisted Cassandra & KairosDB as the alternative choice for storing time series data. We liked the quick setup and less operational burden compared to OpenTSDB. We are also thankful to Datastax for providing us with an enterprise edition under their startup license program. Cassandra offered us mature client libraries support for easier integration. We had two use cases where we had to store time series data using KairosDB and another one where data is loaded directly into Cassandra in separate name space. The initial PoC and the results were positive as it had good support for various data collectors and support for basic data aggregators so we can quickly build new reporting features.

While Cassandra worked for us until early 2016, we had our share of challenges—including the issues with batch load of data ingest into Cassandra. In addition, as we acquired larger customers with lots of data, the Cassandra clusters had to be scaled vertically with high-end machines and horizontally with more nodes. We were ingesting hundreds of millions of records every day into this data store and still doing application-level data aggregation on top of Cassandra using CQL which was a time consuming exercise for most of our engineers. We have even written a small utility to archive stale data out of Cassandra to reduce the query latency issues. Our product roadmap had plans to support any custom reporting within Botmetric where our end users should be able to build any kind of reports with the DIY interface, with growing use cases around reporting, slicing and dicing of data and custom reporting requests. So we were again facing a new set of problems with building things faster and our sprint productivity was a major issue.

We wanted to migrate a major part of data that’s primarily used by end users for insights, adhoc reporting on multiple dimensions into a search store so we can reduce the dependency on Cassandra for it and in the process, our primary choice was Elasticsearch. From late 2015, we have started moving away from a lot of metadata around cloud infrastructure, billing and usage records etc. into it for easier and faster querying data from our SaaS Platform.

While data stores are core engines of what we do, Botmetric SaaS Platform is an amalgamation of 25+ micro services that are focused on specific use cases within our product platform including our common platform services. As our SaaS platform has grown with more customers and decoupled into microservices-based architecture, we needed to stream data from our microservices, components usage and their monitoring metrics etc. a reliable store that can be used for Botmetric operational insights apart from supporting our time series data needs for our product use adds.

Finally A Real Time Data Store That Works (Circa Late 2015 And early 2016) - InfluxDB

The Botmetric platform collects lots of data of different nature (real time, batch pull like crawlers, event data pull and push based events etc) and different dimensions (minutes, hours, daily, weekly, monthly). Our search for reliable time series and real time data store wasn’t achieved despite using Cassandra and KairosDB for over a year in production.

One of the unique differentiators of the Botmetric platform compared to other SaaS tools is our powerful automation framework for DevOps teams to invoke  automated actions either based on real-time events or scheduled workflows. Currently we execute thousands of jobs every day for our customers to handle their tasks—this is expected to reach millions of tasks as we scale our large enterprise customer acquisition and the metadata around all Botmetric automations should be tracked so we can notify the end users and provide visibility into what’s done and what’s not. Our Technical Architect, Yuvaraj, was an internal champion for InfluxDB from early 2015 and we had deployed the InfluxData TICK stack with Grafana for monitoring of our microservices events (it’s another story on why we moved away from Datadog to the TICK stack after our micro services architecture and docker adoption in our production environment).

The amazing thing about InfluxDB is its simplicity, ease of use, support for various client libraries and great aggregation capability for querying. And the lack of operational overhead like OpenTSDB or KairosDB+Cassandra etc. We just loved what we saw—what the TICK stack deployment did for our SaaS platform metrics collection and events monitoring. In addition, InfluxDB brings simplicity from development to production operations and offers reduced overhead for our engineering teams as it’s a simple package without the complexity of multiple layers of management. The querying of data and data aggregation from InfluxDB was much easier compared to the complexity of Cassandra CQL and the auto expiry support for certain datasets further reduce DevOps effort required to clean up old data using separate utilities.

We have retired our entire KairosDB+Cassandra cluster and replaced it with an InfluxDB and Elasticsearch. We even monitor our Cassandra cluster using InfluxDB (TICK) + Grafana until it was phased out!

Today, InfluxDB and the TICK stack are central components in the Botmetric technology landscape and will further evolve to become our core data stores; we have contributed a couple of patches to Telegraf and will continue to adopt InfluxDB to be our core data store as we build new real-time use cases that are event-driven in nature.

As an engineer or architect, today, if you are choosing, evaluating or looking for a real time data store then InfluxDB is a savior; its simplicity is amazing and will certainly speed up your application development time. The simple operational management of InfluxDB is lot more important if it’s a critical data store for you so you don’t need to break your head during any production debugging and their active community support will be very helpful.

We often refer to InfluxDB as good choice for “High Velocity Real Time Metrics Data Store”, your search should end here for most use cases :)