Hulu is a U.S.-based subscription service, offering live and on-demand TV and movies, as well as live news and sports, with and without commercials. Hulu launched in 2008 and is the only service that gives viewers instant access to current shows from every major U.S. broadcast network.
Time series data is important at Hulu because it allows them to evaluate trends in order to identify issues and react to them. After working with a number of other time series databases that didn’t meet their needs, Hulu was able to create a stable, fast and durable pipeline with InfluxDB and Kafka. They created a system using InfluxDB to handle problematic clusters, disable huge portions of the infrastructure and re-route those to other datacenters with absolutely no end user impact, as well as create query limitation solutions and block or filter “bad” metrics.
Originally, each development team at Hulu created and used their own time series data solution which ultimately proved wasteful due to inconsistencies. Because all of their teams had similar needs, they decided to build an original time series data pipeline, based on Graphite, which provided a database for all of their engineering teams. At the time, their throughput for this cluster was 1.4 million metrics per second. Unfortunately, they were experiencing a number of issues originating from providing the pipeline as a shared service between all development teams at Hulu. They also faced issues with scalability due to the tremendous throughput required and needed to support an unheard of service-level agreement (SLA) of 100% uptime, versus the typical 99.999%, given the nature of their live streaming consumer-facing service.
Hulu initially decided to re-architect their pipeline based on InfluxDB in order to circumvent the issues they were experiencing with their databases and reduce the need for manual intervention. They created two identical clusters, located in each of their primary data centers. A metric relay cluster was built on top of this with the sole purpose of pushing all metrics received to both clusters, which allowed the metrics to be retrievable from any datacenter. This completely eliminated the metric availability issues that they had experienced previously.
Hulu incorporated Telegraf, and then Kafka, so that in the event of a failure of one of the InfluxDB clusters, the “writers” would be redirected to another datacenter that continues ingesting their metrics, until the other is back up and running and they are both on track. This design allowed Hulu to completely disable huge portions of the infrastructure and route those to another datacenter with absolutely no end user impact.
1.4 million +
Number of metrics collected per second
Need performant DevOps solution
Need to scale to meet SLA’s
“Time series data is and will continue to be a crucial part of Hulu’s ability to evaluate trends and react to them. We were able to address all of the issues in our previous pipeline and are now transitioning all users off our legacy platform.”
- Samir Jafferali, Senior Systems Engineer, Hulu
Related Customer Stories