Customer Success Story: Optum
To make systems and software as reliable and resilient as possible, Optum needed to have observability across its systems and infrastructure and needed to reduce snowflakes (servers that require special configuration beyond that covered by automated deployment scripts) with automatic deployment. In Optum’s case, the observability journey evolved hand in hand with their growing scalability and configuration automation needs.
Optum had to tackle the issue of having a vast number of different tools and solutions while being very much a self-service company, geared at enabling their data center teams to focus on application improvement without having to worry about underlying configurations and infrastructure.
Optum therefore needed a dynamic configuration automation solution that would scale with its massive data center infrastructure, flex with the diverse needs of its various teams and minimize manual work to provide a self-support model, thereby freeing its teams to focus on application optimization rather than infrastructure. For that purpose, Optum developed an attribute repository in-house — called “Lighthouse” — to control what configurations, versions and plugins go out to each individual server deployed with any of the UnitedHealth Group data centers.
The purpose of Lighthouse is to store different pieces of information for different groups of servers. When scaling from tens to thousands and hundreds of thousands of servers, Lighthouse gives Optum the capability to store attributes for groups of servers and to dynamically add attributes. Developing Lighthouse turned out to be an evolving process during which Optum received requests and underwent trial-and-error periods. Once built, Lighthouse allowed Optum to change configurations very quickly and dynamically since their organization has thousands of VMs spun up and spun down every day.
Optum uses the InfluxDB time series database to ingest monitoring data (which is time series data) given InfluxDB’s purpose-built design, high write throughput and scalability.
With Telegraf, Optum wanted to collect data in a way that’s so easy that they would have no reason to use a different metrics collection tool. They wanted to set Telegraf to collect all the data their teams needed, enabling them to act on it as they needed (such as for alerting or sending it to a different platform). This required the ability to assign attributes dynamically, using Lighthouse as their attribute repository.
With automatic deployments via Lighthouse, InfluxDB and Telegraf — and by deploying Chef configuration management tool for writing system configuration “recipes” — Optum achieved its goal of reducing snowflakes (servers that require special configuration beyond that covered by automated deployment scripts) and gaining observability.
“How do you make this so easy? That was our goal – we want this to install Telegraf and use InfluxDB. We want this to be so simple that you don’t have to do any work other than requesting that the Chef cookbook get applied to your server.”
Matthew Iverson, SRE team lead, Optum