Stop Trusting Container Registries, Verify Image Signatures
By Wojciech Kocjan Tyson Kamp / Oct 18, 2022 / InfluxDB, InfluxDB Cloud, Security
One of InfluxData’s main products is InfluxDB Cloud. It’s a cloud-native, SaaS platform for accessing InfluxDB in a serverless, scalable fashion. InfluxDB Cloud is available in all major public clouds.
InfluxDB Cloud was built from the ground up to support auto-scaling and handling different types of workloads. Under the hood, InfluxDB Cloud is a Kubernetes-based application consisting of a fleet of micro-services that runs in a multi-cloud, multi-region setup.
The application consists of a storage tier that uses Persistent Volumes and cloud-native object storage (such as S3 on AWS cloud) for persistence. It uses Kafka and Zookeeper for queueing incoming data and managed SQL databases for storing other data. The application also consists of around 50 stateless microservices that perform various operations like writing and querying time series data, as well as periodically running tasks.
In the cloud-native offering of InfluxDB, we identified a specific security concern. The application stores thousands of containers at third party registries, which then deploy into our clusters. How do we know that a container at pull/run time is the same container used in our CI/CD pipeline? What if something compromised a third party registry?
Requiring signatures for container images
Per standard risk ownership models of cloud-based systems, an entity (company, etc.) is responsible for the security of the supply chain and components of its offering, regardless of the component provider(s) or service/system vendor(s) that make up the offering. “It wasn’t us” isn’t acceptable.
When considering how to mitigate a complete compromise of our container registry, one somewhat brute force idea comes to mind. After pushing images, the remote registry returns a digest, which is useful for identifying the image and verifying its integrity. One option for a mitigation solution involves creating a database of the digests the application obtains just after pushing the containers. If you trust hashing and standard crypto tools, and keeping a database of this information is acceptable, then this more-or-less handles image authenticity and integrity. But this approach creates additional challenges. You need to make sure all consumers of the container image(s) have up-to-date access to the digest database.
A second, and more appealing, option is to create a signature for each container image at container push time and make the list of public keys that can validate signatures easily available. After all, the public keys aren’t sensitive. The list of public keys is important though (think replay attacks) but more on that later.
For Influx’s interest, we consider a risk mitigated when we can verify that all relevant OCI container images intended to run on InfluxData managed clusters originated from InfluxData – either at some point in time or from inception–and have not been tampered with (authenticity and integrity). Signatures enable us to detect any tampering, so they’re really appealing for this mitigation strategy. Later we’ll look at automatic integrity and authenticity management for supply chain more in-depth, but we’re taking a “baby steps” approach.
Architecture and images
We build InfluxDB Cloud, primarily, with code and integrations written in-house. We use a CI system to build the code and CD systems to deploy it. This ensures that we can build and deploy any changes to the application’s code as soon as possible.
We also use multiple open source components from InfluxData, such as Telegraf or Telegraf-operator, as well as third-party components, such as Kafka and HashiCorp Vault. The InfluxDB Cloud teams don’t–and in some cases don’t want to–build or control these third party images in-house. Nevertheless, the team has the ability to review and to choose to accept specific images – preferably by their SHA digests – and to sign those images. We keep the signature in a separate location, which we describe in more detail in the next section.
What we’re looking to do is to create signing keys and signatures often, and to make the public verification keys easily available. This approach is simpler and more scalable than tracking digests and worrying about consistency. InfluxData currently manages a large number of production clusters across three cloud providers, and we think this container signing idea should scale well.
Adding digital signatures
In the early stages of this project, the team looked at two GitHub repos: Connaisseur and the SigStore project policy-controller. Connaisseur proved to be very quick to set up and easy to configure for proof of concept purposes. Policy-controller was more time consuming and complicated to configure, but we accepted this trade off because it’s often the case that configurability breeds complexity. The team eventually got policy-controller working by automating the creation of the ClusterImagePolicy and re-applying it. Next, they automated the standing up of a test environment and created a Bash script to conduct a positive and negative test of signature validation.
Connaisseur appeared to be more mature in its development but was not part of a larger system targeting supply chain risk, like SigStore is. Further, Connaisseur is written in Python and seems to have less active development and participation. Given the more complete nature of policy-controller/SigStore to address the needs of supply chain risk, its active development (albeit much less mature), and the fact that it’s written in Golang (like InfluxDB), InfluxData opted for policy-controller.
For creating signing key pairs and performing signature creation and validation, we opted for cosign. This was an easy choice to make. It’s just the right tool for the job.
We also wanted to make rotating key pairs easy where automated jobs rotate the signing key pairs to create and verify signatures. We are still tuning the rotation frequency, but we’re targeting rotating on a weekly basis, at least, and no more than a few times a day. We store the signing key pairs in HashiCorp Vault and they never leave it, leveraging Vault to perform the signing process.
A secure and trusted endpoint, available within our internal network, makes non-sensitive public keys available. All clusters that consume images and perform their validation periodically pull the latest set of public keys and update their local configuration accordingly. If the cluster can not validate a signature with the list of public keys returned from the trusted endpoint, then the cluster won’t load the image.
This enables InfluxData to create short-lived key pairs and signatures while also enabling clusters that consume images to validate signatures for container images.
For all our in-house code, the CI systems automatically sign all the code that was reviewed, approved, and is intended to run in production environments. We store digital signatures for those container images in the same location(s) as the images themselves.
We reference container images for open-source and third-party images externally, and we keep InfluxDB Cloud’s signatures in a dedicated image registry that we control. This way InfluxDB Cloud can reference upstream images but create and maintain signatures in our image registry. Sigstore cosign and policy-controller fully support this approach.
As part of InfluxDB Cloud metadata, teams managing infrastructure keep lists of all open-source and third-party images that are allowed to run. The list consists of specific images, along with their SHA digests. All those images are periodically signed, with the signature written to an OCI registry controller by InfluxData. This enables our Kubernetes clusters that validate signatures to run the images, even if the images themselves reside in upstream registries.
This setup does create some additional burden when it’s necessary to update any application that’s not part of InfluxDB Cloud code. Any update requires getting an updated list of upstream images and ensuring they are signed before performing any updates. This, however, is an upside because it ensures that reviewing images changes becomes part of the review process for updating an external component.
After InfluxData defined the approach and processes above, deployment and enablement of signing and verification began. This started by signing a subset of images, followed by deploying policy-controller and validating these images in a single Kubernetes cluster.
After some initial challenges, and once validation worked correctly in one cluster, we enabled policy-controller on additional clusters and updated our checks to include all the images.
InfluxData manages its infrastructure using GitOps, so enabling it for production means enabling policy-controller and the logic for updating the image validation policies, and keeping a list of valid public keys up-to-date.
Once all this setup is live on additional Kubernetes clusters, InfluxDB Cloud workloads can validate their container images.
Here is a diagram of how our infrastructure is set up:
InfluxDB Cloud Container Trust
Handling security incidents
We gave specific attention to keeping key rotation as simple as possible when security incidents happen. This solution is one of many that requires attention in the incident scenario, so we took any opportunity we could to oversimplify the process.
There are as few configuration items as possible and we simplified the architecture as much as we could. The documentation receives input from multiple teams and doesn’t “pass” as usable until people with little-to-no knowledge of the implementation can follow directions created to reset the system. This includes:
Rotating security control artifacts, such as keys for authentication from CI systems to image signing endpoint
Generating new key pair for digital signing of container images
Creating new signatures for all container images
Hastening deprecation of potentially compromised key pairs (which causes older signatures to become invalid)
We can achieve recovery of this system in its entirety in a few hours, including a complete redeploy of newly signed container images across all clusters.
One of the main threat vectors we considered when designing this system was the replay attack. A replay attack, in this scenario, is the ability to have software components with known vulnerabilities reinstalled into a system. For example, if an attacker discovered a severe vulnerability in a set of container images, they could obtain these images and their signatures from registries in order to try to reintroduce them (and the vulnerability) into a system later.
The InfluxData solution rotates signing keys so frequently that reintroducing an image with a known vulnerability is practically infeasible. The window of time when a signature is valid is too small to be of practical use to an attacker because the effective lifespan of the signature is a few days or weeks at most.
The solution uses only publicly available crypto solutions, and state of the art encryption standards. InfluxData doesn’t create any of its own security components, but rather deploys well-known components and controls in a rapid CICD GitOps framework. InfluxData believes in the inherent strength of this model.
The Influx Container Trust solution implemented only depends on Kubernetes and SigStore components. The solution is agnostic to container registries and cloud providers, and operates in any K8s cluster Influx manages. Adoption across Influx domains is therefore seamless.
While there is clearly no “one size fits all” solution, InfluxData endeavors to mitigate the registry compromise threat in a way that best fits its needs. The approach overlaps with the needs of other groups (corporate, governmental, etc) and, hopefully, offers some ideas about addressing these types of risks, and adopting this threat mitigation solution. InfluxData deployments team members Wojciech Kocjan and Tyson Kamp will present at SigStoreCon on Tuesday October 25, 2022 in Detroit, Michigan (USA) to expand on this blog post. Feel free to attend or contact them for additional info.