< Back to customers

How Hulu Scaled Real-Time Monitoring for Millions of Viewers

REGION

North America

INDUSTRY

Entertainment

Start building with InfluxDB

Start exploring InfluxDB and bring high-performance time series analytics to your applications.

Try InfluxDB

Overview

Reliability at streaming scale

As one of the largest streaming platforms in the U.S., Hulu depends on time series data to monitor infrastructure health, identify issues quickly, and maintain a reliable viewing experience. After evaluating multiple time series databases that fell short of its performance and scalability requirements, Hulu built a stable, high-performance telemetry pipeline with InfluxDB and Kafka. The resulting architecture enables engineering teams to isolate problematic clusters, reroute traffic between datacenters with no impact to viewers, enforce query limits, and filter noisy or malformed metrics before they affect system performance.

Time series data is and will continue to be a crucial part of Hulu’s ability to evaluate trends and react to them. We were able to address all of the issues in our previous pipeline and are now transitioning all users off our legacy platform.

Samir Jafferali

Senior Systems Engineer at Hulu

Challenge

Meeting the demands of streaming at scale

Originally, each development team at Hulu built and maintained its own time series data solution. While this approach gave teams flexibility, it ultimately proved wasteful and led to inconsistencies across the organization. Because the needs of Hulu’s engineering teams were largely the same, the company decided to build a centralized time series data pipeline based on Graphite. The platform provided a shared database for all engineering teams and, at the time, handled throughput of 1.4 million metrics per second.

As the platform grew, Hulu began to encounter challenges associated with operating the pipeline as a shared service. The tremendous volume of data created scalability concerns, while maintaining the service required significant manual intervention. At the same time, Hulu needed to support an unprecedented service-level agreement (SLA) of 100% uptime—well beyond the typical 99.999% target—because of the demands of a live, consumer-facing streaming service.

To overcome these limitations, Hulu rearchitected its pipeline around InfluxDB. The team deployed two identical clusters, one in each of its primary data centers, and built a metric relay cluster designed to send every incoming metric to both environments. As a result, metrics could be retrieved from either data center at any time, completely eliminating the metric availability issues the team had experienced with its previous architecture.

Solution

Engineered for continuous availability

Hulu later incorporated Telegraf and Kafka into the architecture to further improve resiliency. In the event of a failure in one InfluxDB cluster, write traffic could be automatically redirected to the cluster in the other datacenter, ensuring metrics continued to be ingested until the affected cluster was restored and brought back into sync. This design enabled Hulu to take entire sections of infrastructure offline and seamlessly route traffic to another datacenter with no impact to end users.

Result

60 million metrics per minute

Hulu collects more than 60 million metrics per minute, giving engineering teams deeper visibility into their infrastructure and the ability to better identify and analyze trends. That visibility has helped the company meet its 100% uptime SLA while continuing to scale its operations.

By combining Kafka with InfluxDB, Hulu built a high-performance data pipeline capable of supporting massive throughput without limiting growth. The architecture provides the scale, reliability, and flexibility needed to support one of the world’s largest live streaming platforms.