Hulu is a U.S.-based subscription service, offering live and on-demand TV and movies, as well as live news and sports, with and without commercials. Hulu launched in 2008 and is the only service that gives viewers instant access to current shows from every major U.S. broadcast network.

Time series data is important at Hulu because it allows them to evaluate trends in order to identify issues and react to them. After working with a number of other time series databases that didn’t meet their needs, Hulu was able to create a stable, fast and durable pipeline with InfluxDB and Kafka. They created a system using InfluxDB to handle problematic clusters, disable huge portions of the infrastructure and re-route those to other datacenters with absolutely no end user impact, as well as create query limitation solutions and block or filter “bad” metrics.

Originally, each development team at Hulu created and used their own time series data solution which ultimately proved wasteful due to inconsistencies. Because all of their teams had similar needs, they decided to build an original time series data pipeline, based on Graphite, which provided a database for all of their engineering teams. At the time, their throughput for this cluster was 1.4 million metrics per second. Unfortunately, they were experiencing a number of issues originating from providing the pipeline as a shared service between all development teams at Hulu. They also faced issues with scalability due to the tremendous throughput required and needed to support an unheard of service-level agreement (SLA) of 100% uptime, versus the typical 99.999%, given the nature of their live streaming consumer-facing service.

Hulu initially decided to re-architect their pipeline based on InfluxDB in order to circumvent the issues they were experiencing with their databases and reduce the need for manual intervention. They created two identical clusters, located in each of their primary data centers. A metric relay cluster was built on top of this with the sole purpose of pushing all metrics received to both clusters, which allowed the metrics to be retrievable from any datacenter. This completely eliminated the metric availability issues that they had experienced previously.

Hulu incorporated Telegraf, and then Kafka, so that in the event of a failure of one of the InfluxDB clusters, the “writers” would be redirected to another datacenter that continues ingesting their metrics, until the other is back up and running and they are both on track. This design allowed Hulu to completely disable huge portions of the infrastructure and route those to another datacenter with absolutely no end user impact.

By collecting over 60 million metrics per minute, Hulu has gained better insights into their infrastructure which has helped them meet their 100% uptime SLA and to better evaluate trends. They are using Kafka and InfluxDB, the purpose-built time series platform, to sufficiently scale their performant architecture and data pipeline without hampering growth. Start monitoring your own architecture with InfluxDB for free.

Additional resources