Balancing Speed and Safety with Continuous Delivery

Navigate to:

The benefits of continuous delivery are well known these days: rapid feedback, speed of innovation, reduced fault recovery time, and increased confidence in release processes. Along the same lines, those who release less frequently are likely to encounter more stress. Continuous delivery is a spectrum; it doesn’t have to mean blasting every commit to all production environments at once. So, how do we strike a balance between speed and safety? This is what we asked ourselves when we built out the InfluxDB Cloud Dedicated platform.

InfluxDB Cloud Dedicated is a fully-managed, single-tenant version of InfluxDB 3 that provides the performance and security of a dedicated database while maintaining the ease of use and scalability of a managed service. It is designed for customers who require high performance as well as the isolation and flexibility of a dedicated Kubernetes cluster.

From the outset, we wanted to build a platform that could scale reliably with clusters and users, minimize manual intervention, and automate the platform for all operational tasks, releases, and deployments. That meant that deployments of new versions of the database needed to be fully automated. However, any powerful automation has the potential to be destructive if it makes the wrong decisions. We needed to establish guardrails around the release process to ensure we were practicing continuous delivery safely.

How Cloud Dedicated works

InfluxDB Cloud Dedicated is driven by our internal control plane, which is a set of Kubernetes controllers deployed into our staging and production environments. Using declarative configuration and Kubernetes reconciliation loops, these controllers are able to create clusters with supporting cloud infrastructure for our customers. We seed these clusters with an installation of FluxCD, which continuously synchronizes cluster-specific manifests to run InfluxDB internally.

cloud-dedicated-works

How our deployments work

Each commit to the database code repository triggers the deployment of a new manifest bundle into our staging environment. The artifact of promotion is an OCI bundle containing Jsonnet. Later, we will render this with a cluster-specific set of parameters into an OCI YAML bundle, producing cluster-specific YAML manifests to be synced by FluxCD. To keep things at a high level, I will refer to this Jsonnet bundle as “the bundle”. In staging, we have a canary cluster whose health functions as a signal to gate promotion to production. Therefore, any change to our controllers or bundle has to be healthy in staging before either is considered for production. So, we are practicing true CD in staging. So far, so good.

If staging is healthy then we promote the controllers and the bundle to production. But how can we safely do that, assuming that the bundle will be rolled to every production cluster at once?

Release channels to the rescue

Most software that is continually updated has some kind of release channel system, such as Linux distributions. You can subscribe to the bleeding edge, or use something like a long-term support release. We took inspiration from this model to create a release channel system for our production environment that progresses from “bleeding edge” to the most stable—a bundle must remain healthy in each phase for 12 consecutive hours before it is promoted to the next phase. This allows us to get an early signal on the quality of every bundle and promote only the most stable ones. We use our internal clusters as canaries to get a robust signal on those early channels, and our customers’ production clusters live at the very end of the pipeline. The 12-hour bake-in per phase not only creates a high bar for promotion, it also ensures our customers’ clusters are not being constantly deployed to. Each deployment potentially causes pods to roll for key components that are accepting writes and processing queries for our customers.

These release channels are a custom resource in our control plane—think of them as a named bundle URL.

However, as each channel can have many clusters, how do we ensure they don’t all receive the updates at once?

Release channel rollouts

Our release channel rollouts take inspiration from a mature Kubernetes pattern: rolling updates. In Kubernetes you can tweak configuration knobs for rolling updates, such as “max unavailable” and “max surge.” Our release channel controller is similar in that it will update a few clusters (analogous to “surge”), wait for confirmed successful rollouts (analogous to a pod becoming available), and then proceed to update more clusters. If a cluster on the channel becomes unready at any time, the rolling update stops. We can adjust the number of clusters that are simultaneously updated, but we must balance the risk of an enlarged blast radius with the speed of the rollout. Given all clusters’ manifests are generated from the same bundle, albeit at different revisions, we handle bigger variations (e.g., enabling/disabling experimental features or performance tuning, or trying out new features in earlier release channels before they are production) using feature flags.

Feature flags

Feature flags can be toggled on and off per cluster without requiring a redeployment, making them a powerful tool for experimental or otherwise optional features. However, without proper caution, they can be abused to circumvent the safety of our auto-promotion system through the established production release channel pipeline.

Feature flags for us are essentially configuration patches that customize a cluster’s manifests, e.g., setting an environment variable, mounting a volume, or changing some content in a config map. They can also be used to opt-in to features such as private networking, which not all clusters require. In such cases, the flag’s overlay can include many additional manifests, as well as tweaks to existing ones.

We often use feature flags to roll out new features, first in staging, then in internal production clusters, before “promoting” the flag, which is our term for enabling it by default in all clusters.

Defects and break-glass procedures

With all this automation and safety in place, it is very rare for a defect to occur in anything but the earliest phases, such as staging, or perhaps the latest channel.

If a defect in a bundle causes a cluster to become unready, the release channel controller will not progress the rollout, and the channel will remain unready, blocking auto-promotion of the bundle. We can then pin the channel to the last known good revision, restoring clusters that received the update to health while we wait for a fix to be promoted through the pipeline.

Hypothetically, if a defect progressed through the pipeline unnoticed—i.e., it didn’t render the workloads unhealthy but was undesirable in some way—we would intervene in the same way by pinning affected channels and stopping auto-promotion.

Suppose pinning isn’t an option, and we need to accelerate a fix through the pipeline. In that case, we have a mechanism called “force promote,” which informs the auto-promotion system that if a specific revision appears, it should be rushed through, skipping the 12-hour bake-in (up to a certain channel in the pipeline). This revision still needs to be healthy in all clusters within a channel, so it is relatively safe. However, a risk assessment may deem it acceptable to skip the bake-in time, e.g., during an outage.

Phased promotion of feature flags

Feature flags can be applied at the release channel level rather than the cluster level, meaning that all clusters in the release channel will receive the flag. Since a release channel’s health reflects the sum of the health of all clusters on the channel, we can promote flags through the pipeline of production release channels safely. For example, we can use our control plane’s CLI to “promote flag-xyz to the internal channel.” Once we observe that it has completed and the channel is healthy, we can choose to promote it further. This process makes feature flag rollout much safer.

We are considering auto-promotion of feature flags, just as we have for bundles, as well as adding alerts that detect when flags have been applied to channels late in the pipeline but not earlier channels, which risks deploying untested code. This additional layer would help to enforce phased promotion.

Summary

Continuous delivery is desirable in today’s fast-paced software development world, but there is no reason it should mean a reckless approach to releases. If you are deploying every release somewhere and promotions are carefully gated to ensure that only the most stable releases progress to the most reliability-sensitive environments, you can have all the benefits of continuous delivery while controlling risk. Given that we can tune aspects of the promotion pipeline, such as bake-in time and the number of clusters on a channel that update simultaneously, the levers of control are firmly within our grasp.