<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>InfluxData Blog - Wojciech Kocjan</title>
    <description>Posts by Wojciech Kocjan on the InfluxData Blog</description>
    <link>https://www.influxdata.com/blog/author/wojciech-kocjan/</link>
    <language>en-us</language>
    <lastBuildDate>Tue, 22 Nov 2022 07:00:00 +0000</lastBuildDate>
    <pubDate>Tue, 22 Nov 2022 07:00:00 +0000</pubDate>
    <ttl>1800</ttl>
    <item>
      <title>Monitoring Network Outages at the Edge and in the Cloud</title>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published in &lt;a href="https://thenewstack.io/monitoring-network-outages-at-the-edge-and-in-the-cloud/"&gt;The New Stack&lt;/a&gt; and is reposted here with permission.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gathering data to explore a problem with power outages creating connectivity issues and ultimately draining a laptop battery.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Monitoring locations that have intermittent power and/or connectivity outages can be challenging. In this article, I’ll show how to use InfluxDB, an open source time series database, InfluxDB Cloud and  &lt;a href="https://www.influxdata.com/products/influxdb-edge-data-replication/"&gt;Edge Data Replication&lt;/a&gt;  to store data locally and send it to a central location whenever possible. I’ll also use Telegraf, an open source data collection agent, to retrieve some of the data needed for this purpose.&lt;/p&gt;

&lt;h2 id="connectivity-issues"&gt;Connectivity issues&lt;/h2&gt;

&lt;p&gt;In this case, I’m sharing a small closet in a remote location with a laptop running Windows. This machine allows my friend to print and scan documents, and I use it to gather air quality data. The setup works fine most of the time; however, occasional power and Internet outages limit my friend’s ability to work with documents, as well as me collecting air quality data.&lt;/p&gt;

&lt;p&gt;There could be multiple reasons for the failures. The first obvious problem could be with the computer shutting down. The most likely root cause here is power outages, but other factors could be at play too, such as other hardware or software issues. Additionally, there might be problems with the network equipment. The original setup included a local switch and router provided by the internet service provider (ISP).&lt;/p&gt;

&lt;p&gt;I suspected that power outages caused the initial problem, which created the connectivity issues and ultimately resulted in the laptop running out of battery. However, I needed more data to validate my hypothesis.&lt;/p&gt;

&lt;p&gt;I went about updating my setup for two reasons: 1) to avoid or minimize data loss in the event of a power outage, and 2) to gather more data to validate my hypothesis on the root cause. To that end, I decided to gather data on the laptop’s battery and its connectivity to the switch and the router.&lt;/p&gt;

&lt;h2 id="previous-setup"&gt;Previous setup&lt;/h2&gt;

&lt;p&gt;Originally, the laptop ran  &lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;Telegraf&lt;/a&gt;  and sent data to  &lt;a href="https://cloud2.influxdata.com/signup/"&gt;InfluxDB Cloud&lt;/a&gt;. But I changed the setup to improve it and troubleshoot these connectivity issues. By default, Telegraf sends data when there is access to the internet, buffering the data in memory in the event of connection failures. However, if the network goes down and a power outage occurs, that drains the laptop’s battery, and I lose all my data.&lt;/p&gt;

&lt;p&gt;Here’s the previous setup:&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2zFVBXZYdE3iSUeXKPOWYd/05d8b1dcc0b221b4671d81e2895efe60/Monitoring-Network-Outages-at-the-Edge-and-in-the-Cloud-OG.jpg" alt="Monitoring-Network-Outages-at-the-Edge-and-in-the-Cloud-OG" /&gt;&lt;/p&gt;

&lt;h2 id="the-new-setup"&gt;The new setup&lt;/h2&gt;

&lt;p&gt;To remedy this situation, the laptop now runs a local instance of  &lt;a href="https://www.influxdata.com/get-influxdb/"&gt;InfluxDB 2.0&lt;/a&gt;  (OSS). Telegraf, which used to output metrics directly to InfluxDB Cloud, now sends all of its data to the local instance of InfluxDB. I configured the local InfluxDB instance for  &lt;a href="https://www.influxdata.com/blog/edge-data-replication/"&gt;Edge Data Replication&lt;/a&gt;, which allows me to set a bucket to automatically send the data in that bucket to an instance of InfluxDB Cloud, whenever there is connectivity.&lt;/p&gt;

&lt;p&gt;This setup is more resilient to power outages. Telegraf now sends data to the local InfluxDB OSS instance, so it immediately persists on the laptop. OSS then sends that data to my InfluxDB Cloud account, where I can easily query the data over the internet and build alerts on top of it.&lt;/p&gt;

&lt;p&gt;Here’s the new setup, including &lt;a href="https://www.influxdata.com/glossary/edge-computing/"&gt;edge data&lt;/a&gt; replication:&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/4VvLy0SIxP9hEllVg3QaqY/e2d5554534e7fef889a7ddec6b6eab10/The-New-Setup.png" alt="The-New-Setup" /&gt;&lt;/p&gt;

&lt;p&gt;The diagram also includes additional data to help troubleshoot the original issue.&lt;/p&gt;

&lt;h2 id="configuring-the-new-setup"&gt;Configuring the new setup&lt;/h2&gt;

&lt;p&gt;The laptop automatically logs in when powered on and starts multiple applications, including Telegraf.&lt;/p&gt;

&lt;p&gt;I set up my InfluxDB OSS instance and the InfluxDB CLI following instructions on the InfluxDB  &lt;a href="https://portal.influxdata.com/downloads/"&gt;download page&lt;/a&gt;  and installed them to C:\Program Files\InfluxData\influxdb.&lt;/p&gt;

&lt;p&gt;Then I created a new shortcut in the Startup folder (such as  &lt;strong&gt;C:\ProgramData\Microsoft\Windows\Start Menu\Programs\StartUp&lt;/strong&gt;  or  &lt;strong&gt;C:\Users\%USERNAME%\AppData\Roaming\Microsoft\Windows\Start Menu\Programs\Startup&lt;/strong&gt;) to run  &lt;strong&gt;cmd.exe /c “c:\Program Files\InfluxData\influxdb\influxdb2_windows_amd64\influxd.exe”&lt;/strong&gt;  and set the  &lt;strong&gt;Run&lt;/strong&gt; option to  &lt;strong&gt;Minimized&lt;/strong&gt;, which will start InfluxDB the next time the computer boots.&lt;/p&gt;

&lt;p&gt;To get started setting up InfluxDB, I manually ran the shortcut. Using the InfluxDB CLI I installed earlier, I ran the  &lt;code&gt;influx setup&lt;/code&gt;  command to set up the local instance. I provided my local user, organization ID, bucket name and password.&lt;/p&gt;

&lt;p&gt;I already had an InfluxDB Cloud account so I created a new bucket there for all of my replicated data. I also  &lt;a href="https://docs.influxdata.com/influxdb/cloud/security/tokens/"&gt;created an API token&lt;/a&gt;  in my cloud account that has write access to this new bucket. I’ll need the token later when setting up replication.&lt;/p&gt;

&lt;p&gt;Next, I need to set up Edge Data Replication (EDR) so all the data in my local bucket replicates to the bucket I created in my cloud account.&lt;/p&gt;

&lt;p&gt;At this point I need to create a “remote” that tells my local InfluxDB instance about the instance of InfluxDB Cloud where I want it to write data. For this remote, I need to provide a URL, the organization ID of my cloud account (a 16-digit hex number) and the API token I created earlier. The command to do this is:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;influx remote create --name cloud --remote-url https://eu-central-1-1.aws.cloud2.influxdata.com --remote-api-token "...." --remote-org-id ....&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this case, my cloud account is on AWS in the EU central region. Obviously, this will differ for InfluxDB Cloud accounts in other regions or on different cloud providers.&lt;/p&gt;

&lt;p&gt;IMPORTANT NOTE: The  &lt;strong&gt;remote-org-id&lt;/strong&gt;  must be the organization ID of the InfluxDB Cloud account, NOT the local organization ID from the  &lt;strong&gt;influx setup&lt;/strong&gt;  step.&lt;/p&gt;

&lt;p&gt;Executing the remote command outputs information about the newly created remote. It’s important to copy the  &lt;strong&gt;Remote ID&lt;/strong&gt; (16-digit hex number) because we’ll need it in the next step.&lt;/p&gt;

&lt;p&gt;At this point, we have everything we need to create a replication. Here, I specify a single bucket on my local InfluxDB OSS instance to replicate to my InfluxDB Cloud account. The command is:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;influx replication create --name cloud-mybucket --remote-id .... --local-bucket-id ....--remote-bucket-id ....&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The  &lt;code&gt;**remote-id**&lt;/code&gt;  value must be the ID from the output of the  &lt;code&gt;**influx remote create**&lt;/code&gt;  command above. The values for  &lt;code&gt;**local-bucket-id**&lt;/code&gt;  and  &lt;code&gt;**remote-bucket-id**&lt;/code&gt;  need to be the IDs (16-digit hex numbers) of the buckets in InfluxDB OSS and InfluxDB Cloud, respectively. You can find the bucket ID in the InfluxDB UI. Just go to the Load Data section and click on the Buckets tab.&lt;/p&gt;

&lt;h2 id="data-replication-in-action"&gt;Data replication in action&lt;/h2&gt;

&lt;p&gt;Having a more resilient setup allows me to gather data about outages and, hopefully, to get more data about the underlying problems.&lt;/p&gt;

&lt;p&gt;So, in addition to the air quality data, I started gathering networking data too. I added the  &lt;a href="https://www.influxdata.com/integration/ping/?utm_source=vendor&amp;amp;utm_medium=referral&amp;amp;utm_campaign=2022-10_spnsr-ctn_monitoring-network-outages_tns"&gt;Ping Input plugin&lt;/a&gt;  to my Telegraf to periodically check connectivity of:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The switch that connects the closet with other parts of the building (the switch has its own IP address)&lt;/li&gt;
  &lt;li&gt;The local IP address for the router that provides internet connection&lt;/li&gt;
  &lt;li&gt;The remote IP address of the Internet Service Provider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I updated the Ping Input Plugin configuration to the following:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;[[inputs.ping]]
  urls = ["(switch IP)", "(ISP local ip)", "(ISP remote IP)"]
  count = 3
  timeout = 2.0
  deadline = 10
  interval = "300s"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This checks connectivity to all three IP addresses every five minutes and stores the results in the local instance of InfluxDB. EDR replicates that data to the cloud.&lt;/p&gt;

&lt;p&gt;I also started gathering data about the power and battery state of the laptop. I can tell if the laptop is plugged into AC power, if the battery is draining and how much battery is left. Telegraf doesn’t provide a built-in way to retrieve this data on Windows; however, all I had to do was to write a PowerShell script that queries Windows Management Instrumentation (WMI) and reports this data to Telegraf.&lt;/p&gt;

&lt;p&gt;This approach is quite generic, and you can use it to monitor any custom metric or data that you can query from PowerShell. Use the  &lt;a href="https://github.com/influxdata/telegraf/tree/master/plugins/inputs/exec"&gt;Exec Telegraf Input&lt;/a&gt;  plugin to send those metrics to InfluxDB.&lt;/p&gt;

&lt;p&gt;We can retrieve data that states whether the laptop is currently on AC power from the  &lt;strong&gt;BatteryStatus&lt;/strong&gt;  class, using the  &lt;strong&gt;PowerOnline&lt;/strong&gt;  property.&lt;/p&gt;

&lt;p&gt;We can get data about the battery from the  &lt;a href="https://docs.microsoft.com/windows/win32/cimwin32prov/win32-battery"&gt;&lt;strong&gt;Win32_Battery&lt;/strong&gt;&lt;/a&gt; WMI class and the  &lt;strong&gt;EstimatedChargeRemaining&lt;/strong&gt;  property, which returns the percent of remaining battery life.&lt;/p&gt;

&lt;p&gt;To be sure the script supports laptops with multiple batteries, I wrote it to iterate over all rows.&lt;/p&gt;

&lt;p&gt;The PowerShell script for this is as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;# Determine if at least one battery reports that it is on AC power
$power = 0
Foreach ($row in @(Get-CimInstance -class "BatteryStatus" -namespace root\wmi)) {
  if ($row.PowerOnline -eq $true) {
    $power = 1
  }
}

# Calculate sum of percentage of all batteries and divide by number of batteries
$battery = 0
$count = 0

Foreach ($row in @(Get-CimInstance -classname Win32_Battery -property EstimatedChargeRemaining)) {
  $battery = $battery + $row.EstimatedChargeRemaining
  $count = $count + 1
}

$battery = $battery / $count

# Write results as very minimal implementation of line protocol
Write-Output ("batterystatus online={0}i,battery={1}" -f @($power,$battery))&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can call the script manually via PowerShell:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;PS&amp;gt; PowerShell .\batterystatus.ps1
batterystatus online=1i,battery=100&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;NOTE: PowerShell  &lt;a href="https://docs.microsoft.com/powershell/module/microsoft.powershell.core/about/about_execution_policies"&gt;execution policies&lt;/a&gt;  may not allow the file to run. If this is the case, use the  &lt;a href="https://docs.microsoft.com/powershell/module/microsoft.powershell.utility/unblock-file"&gt;Unblock-File&lt;/a&gt;  PowerShell command to allow a single file to run.&lt;/p&gt;

&lt;p&gt;To add this script to Telegraf, we need to add an inputs.exec statement to its configuration, such as:&lt;/p&gt;

&lt;pre&gt;&lt;code class="language-bash"&gt;[[inputs.exec]]
  interval = "60s"
  commands = ["powershell C:/trex/batterystatus-telegraf.ps1"]
  data_format = "influx"&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This runs the script every minute, stores the results locally and replicates them to InfluxDB Cloud.&lt;/p&gt;

&lt;h2 id="results"&gt;Results&lt;/h2&gt;

&lt;p&gt;After several weeks of gathering data, and a few outages occurring, I was able to get meaningful data:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A power outage in the closet caused four out of five incidents detected with this setup.&lt;/li&gt;
  &lt;li&gt;In all of the power outages, the switch also stopped working. which was the root cause of internet connectivity issues.&lt;/li&gt;
  &lt;li&gt;Network issues at the ISP level cause one outage, where both local and remote IP addresses for the ISP did not respond to ICMP ping.&lt;/li&gt;
  &lt;li&gt;Two of four of the outages related to power lasted long enough to completely drain the laptop battery, causing it to shut down.&lt;/li&gt;
  &lt;li&gt;There were no issues where the laptop stopped gathering data without running out of power first — ruling out any issues with the computer or operating system.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Having this data made it much easier to determine next steps. The local switch and printer are now run behind an uninterrupted power supply (UPS), which resolved most of the issues and made my friend’s life easier.&lt;/p&gt;

&lt;p&gt;As for the laptop shutting down, unfortunately, there’s no good solution. This laptop’s BIOS/UEFI settings do not provide a setting to automatically restart once AC power is restored.&lt;/p&gt;

&lt;p&gt;As for me, I learned how to set up EDR and how to monitor Windows devices using WMI. My air quality data is also more complete as data replication prevents data loss in cases where the laptop shuts down.&lt;/p&gt;
</description>
      <pubDate>Tue, 22 Nov 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/monitoring-network-outages-edge-cloud/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/monitoring-network-outages-edge-cloud/</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <author>Wojciech Kocjan (InfluxData)</author>
    </item>
    <item>
      <title>Stop Trusting Container Registries, Verify Image Signatures</title>
      <description>&lt;h2 id="introduction"&gt;Introduction&lt;/h2&gt;

&lt;p&gt;One of InfluxData’s main products is &lt;a href="https://www.influxdata.com/products/influxdb-cloud/"&gt;InfluxDB Cloud&lt;/a&gt;. It’s a cloud-native, SaaS platform for accessing InfluxDB in a serverless, scalable fashion. InfluxDB Cloud is available in all major public clouds.&lt;/p&gt;

&lt;p&gt;InfluxDB Cloud was built from the ground up to &lt;a href="https://youtu.be/rbmq2cgm9WA"&gt;support auto-scaling&lt;/a&gt; and handling different types of workloads. Under the hood, InfluxDB Cloud is a Kubernetes-based application consisting of a fleet of micro-services that runs in a multi-cloud, multi-region setup.&lt;/p&gt;

&lt;p&gt;The application consists of a storage tier that uses &lt;a href="https://kubernetes.io/docs/concepts/storage/persistent-volumes/"&gt;Persistent Volumes&lt;/a&gt; and cloud-native object storage (such as S3 on AWS cloud) for persistence. It uses Kafka and Zookeeper for queueing incoming data and managed SQL databases for storing other data. The application also consists of around 50 stateless microservices that perform various operations like writing and querying time series data, as well as periodically running &lt;a href="https://docs.influxdata.com/influxdb/cloud/process-data/get-started/"&gt;tasks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the cloud-native offering of InfluxDB, we identified a specific security concern. The application stores thousands of containers at third party registries, which then deploy into our clusters. How do we know that a container at pull/run time is the same container used in our CI/CD pipeline? What if something compromised a third party registry?&lt;/p&gt;

&lt;h2 id="requiring-signatures-for-container-images"&gt;Requiring signatures for container images&lt;/h2&gt;

&lt;p&gt;Per standard risk ownership models of cloud-based systems, an entity (company, etc.) is responsible for the security of the supply chain and components of its offering, regardless of the component provider(s) or service/system vendor(s) that make up the offering. “It wasn’t us” isn’t acceptable.&lt;/p&gt;

&lt;p&gt;When considering how to mitigate a complete compromise of our container registry, one somewhat brute force idea comes to mind. After pushing images, the remote registry returns a digest, which is useful for identifying the image and verifying its integrity. One option for a mitigation solution involves creating a database of the digests the application obtains just after pushing the containers. If you trust hashing and standard crypto tools, and keeping a database of this information is acceptable, then this more-or-less handles image authenticity and integrity. But this approach creates additional challenges. You need to make sure all consumers of the container image(s) have up-to-date access to the digest database.&lt;/p&gt;

&lt;p&gt;A second, and more appealing, option is to create a signature for each container image at container push time and make the list of public keys that can validate signatures easily available. After all, the public keys aren’t sensitive. The list of public keys is important though (think replay attacks) but more on that later.&lt;/p&gt;

&lt;p&gt;For Influx’s interest, we consider a risk mitigated when we can verify that all relevant OCI container images intended to run on InfluxData managed clusters originated from InfluxData – either at some point in time or from inception–and have not been tampered with (authenticity and integrity). Signatures enable us to detect any tampering, so they’re really appealing for this mitigation strategy. Later we’ll look at automatic integrity and authenticity management for supply chain more in-depth, but we’re taking a “baby steps” approach.&lt;/p&gt;

&lt;h2 id="architecture-and-images"&gt;Architecture and images&lt;/h2&gt;

&lt;p&gt;We build InfluxDB Cloud, primarily, with code and integrations written in-house. We use a CI system to build the code and CD systems to deploy it. This ensures that we can build and deploy any changes to the application’s code as soon as possible.&lt;/p&gt;

&lt;p&gt;We also use multiple open source components from InfluxData, such as &lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;Telegraf&lt;/a&gt; or &lt;a href="https://github.com/influxdata/telegraf-operator"&gt;Telegraf-operator&lt;/a&gt;, as well as third-party components, such as Kafka and HashiCorp Vault. The InfluxDB Cloud teams don’t–and in some cases don’t want to–build or control these third party images in-house. Nevertheless, the team has the ability to review and to choose to accept specific images – preferably by their SHA digests – and to sign those images. We keep the signature in a separate location, which we describe in more detail in the next section.&lt;/p&gt;

&lt;p&gt;&lt;img style="margin: 30px auto;" src="//images.ctfassets.net/o7xu9whrs0u9/4iPbNqsfSm4wrqn7FwIUhg/903c825354da2ae87f837aec5baec188/SaaS_Offering.png" alt="SaaS Offering" width="500" height="606" /&gt;&lt;/p&gt;

&lt;p&gt;What we’re looking to do is to create signing keys and signatures often, and to make the public verification keys easily available. This approach is simpler and more scalable than tracking digests and worrying about consistency. InfluxData currently manages a large number of production clusters across three cloud providers, and we think this container signing idea should scale well.&lt;/p&gt;

&lt;h2 id="adding-digital-signatures"&gt;Adding digital signatures&lt;/h2&gt;

&lt;p&gt;In the early stages of this project, the team looked at two GitHub repos: Connaisseur and the SigStore project &lt;a href="https://github.com/sigstore/policy-controller"&gt;policy-controller&lt;/a&gt;. Connaisseur proved to be very quick to set up and easy to configure for proof of concept purposes. Policy-controller was more time consuming and complicated to configure, but we accepted this trade off because it’s often the case that configurability breeds complexity. The team eventually got policy-controller working by automating the creation of the ClusterImagePolicy and re-applying it. Next, they automated the standing up of a test environment and created a Bash script to conduct a positive and negative test of signature validation.&lt;/p&gt;

&lt;p&gt;Connaisseur appeared to be more mature in its development but was not part of a larger system targeting supply chain risk, like SigStore is. Further, Connaisseur is written in Python and seems to have less active development and participation. Given the more complete nature of policy-controller/SigStore to address the needs of supply chain risk, its active development (albeit much less mature), and the fact that it’s written in Golang (like InfluxDB), InfluxData opted for policy-controller.&lt;/p&gt;

&lt;p&gt;For creating signing key pairs and performing signature creation and validation, we opted for &lt;a href="https://github.com/sigstore/cosign"&gt;cosign&lt;/a&gt;. This was an easy choice to make. It’s just the right tool for the job.&lt;/p&gt;

&lt;p&gt;We also wanted to make rotating key pairs easy where automated jobs rotate the signing key pairs to create and verify signatures. We are still tuning the rotation frequency, but we’re targeting rotating on a weekly basis, at least, and no more than a few times a day. We store the signing key pairs in HashiCorp Vault and they never leave it, leveraging Vault to perform the signing process.&lt;/p&gt;

&lt;p&gt;A secure and trusted endpoint, available within our internal network, makes non-sensitive public keys available. All clusters that consume images and perform their validation periodically pull the latest set of public keys and update their local configuration accordingly. If the cluster can not validate a signature with the list of public keys returned from the trusted endpoint, then the cluster won’t load the image.&lt;/p&gt;

&lt;p&gt;This enables InfluxData to create short-lived key pairs and signatures while also enabling clusters that consume images to validate signatures for container images.&lt;/p&gt;

&lt;p&gt;For all our in-house code, the CI systems automatically sign all the code that was reviewed, approved, and is intended to run in production environments. We store digital signatures for those container images in the same location(s) as the images themselves.&lt;/p&gt;

&lt;p&gt;We reference container images for open-source and third-party images externally, and we keep InfluxDB Cloud’s signatures in a dedicated image registry that we control. This way InfluxDB Cloud can reference upstream images but create and maintain signatures in our image registry. Sigstore cosign and policy-controller fully support this approach.&lt;/p&gt;

&lt;p&gt;As part of InfluxDB Cloud metadata, teams managing infrastructure keep lists of all open-source and third-party images that are allowed to run. The list consists of specific images, along with their SHA digests. All those images are periodically signed, with the signature written to an OCI registry controller by InfluxData. This enables our Kubernetes clusters that validate signatures to run the images, even if the images themselves reside in upstream registries.&lt;/p&gt;

&lt;p&gt;This setup does create some additional burden when it’s necessary to update any application that’s not part of InfluxDB Cloud code. Any update requires getting an updated list of upstream images and ensuring they are signed before performing any updates. This, however, is an upside because it ensures that reviewing images changes becomes part of the review process for updating an external component.&lt;/p&gt;

&lt;p&gt;After InfluxData defined the approach and processes above, deployment and enablement of signing and verification began. This started by signing a subset of images, followed by deploying policy-controller and validating these images in a single Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;After some initial challenges, and once validation worked correctly in one cluster, we enabled policy-controller on additional clusters and updated our checks to include all the images.&lt;/p&gt;

&lt;p&gt;InfluxData manages its infrastructure using GitOps, so enabling it for production means enabling policy-controller and the logic for updating the image validation policies, and keeping a list of valid public keys up-to-date.&lt;/p&gt;

&lt;p&gt;Once all this setup is live on additional Kubernetes clusters, InfluxDB Cloud workloads can validate their container images.&lt;/p&gt;

&lt;p&gt;Here is a diagram of how our infrastructure is set up:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;InfluxDB Cloud Container Trust&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/2toXD3nIUDwM0KZsq0gSZV/db56514730371cb0ef05a7e840f7a6b8/InfluxDB_Cloud_Container_Trust.png" alt="InfluxDB Cloud Container Trust" /&gt;&lt;/p&gt;

&lt;h2 id="handling-security-incidents"&gt;Handling security incidents&lt;/h2&gt;

&lt;p&gt;We gave specific attention to keeping key rotation as simple as possible when security incidents happen. This solution is one of many that requires attention in the incident scenario, so we took any opportunity we could to oversimplify the process.&lt;/p&gt;

&lt;p&gt;There are as few configuration items as possible and we simplified the architecture as much as we could. The documentation receives input from multiple teams and doesn’t “pass” as usable until people with little-to-no knowledge of the implementation can follow directions created to reset the system. This includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Rotating security control artifacts, such as keys for authentication from CI systems to image signing endpoint&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Generating new key pair for digital signing of container images&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Creating new signatures for all container images&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Hastening deprecation of potentially compromised key pairs (which causes older signatures to become invalid)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can achieve recovery of this system in its entirety in a few hours, including a complete redeploy of newly signed container images across all clusters.&lt;/p&gt;

&lt;h2 id="closing-thoughts"&gt;Closing thoughts&lt;/h2&gt;

&lt;p&gt;One of the main threat vectors we considered when designing this system was the replay attack. A replay attack, in this scenario, is the ability to have software components with known vulnerabilities reinstalled into a system. For example, if an attacker discovered a severe vulnerability in a set of container images, they could obtain these images and their signatures from registries in order to try to reintroduce them (and the vulnerability) into a system later.&lt;/p&gt;

&lt;p&gt;The InfluxData solution rotates signing keys so frequently that reintroducing an image with a known vulnerability is practically infeasible. The window of time when a signature is valid is too small to be of practical use to an attacker because the effective lifespan of the signature is a few days or weeks at most.&lt;/p&gt;

&lt;p&gt;The solution uses only publicly available crypto solutions, and state of the art encryption standards. InfluxData doesn’t create any of its own security components, but rather deploys well-known components and controls in a rapid CICD GitOps framework. InfluxData believes in the inherent strength of this model.&lt;/p&gt;

&lt;p&gt;The Influx Container Trust solution implemented only depends on Kubernetes and SigStore components. The solution is agnostic to container registries and cloud providers, and operates in any K8s cluster Influx manages. Adoption across Influx domains is therefore seamless.&lt;/p&gt;

&lt;p&gt;While there is clearly no “one size fits all” solution, InfluxData endeavors to mitigate the registry compromise threat in a way that best fits its needs. The approach overlaps with the needs of other groups (corporate, governmental, etc) and, hopefully, offers some ideas about addressing these types of risks, and adopting this threat mitigation solution. InfluxData deployments team members Wojciech Kocjan and Tyson Kamp will &lt;a href="https://sigstoreconna22.sched.com/event/1Aykj/sigstore-or-how-we-learned-to-stop-trusting-registries-and-love-signatures-wojciech-kocjan-tyson-kamp-influxdata"&gt;present&lt;/a&gt; at SigStoreCon on Tuesday October 25, 2022 in Detroit, Michigan (USA) to expand on this blog post. Feel free to attend or contact them for additional info.&lt;/p&gt;
</description>
      <pubDate>Tue, 18 Oct 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/stop-trusting-container-registries-verify-image-signatures/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/stop-trusting-container-registries-verify-image-signatures/</guid>
      <category>Product</category>
      <category>Use Cases</category>
      <author>Wojciech Kocjan, Tyson Kamp (InfluxData)</author>
    </item>
    <item>
      <title>Deleting Production in a Few Easy Steps (and How to Fix It)</title>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published in &lt;a href="https://thenewstack.io/deleting-production-in-a-few-easy-steps-and-how-to-fix-it/"&gt;The New Stack&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It’s the type of nightmare that leaves developers in a cold sweat. Imagine waking up to a message from your team that simply says, “We lost a cluster,” but it’s not a dream at all.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/products/influxdb-cloud/"&gt;InfluxDB Cloud &lt;/a&gt;runs on &lt;a href="https://thenewstack.io/category/kubernetes/"&gt;Kubernetes&lt;/a&gt;, a cloud application orchestration platform. We use an automated &lt;a href="https://thenewstack.io/category/ci-cd/"&gt;Continuous Delivery&lt;/a&gt; (CD) system to deploy code and configuration changes to production. On a typical workday, the engineering team delivers between 5-15 different changes to production.&lt;/p&gt;

&lt;p&gt;To deploy these code and configuration changes to Kubernetes clusters, the team uses a tool called &lt;a href="https://argoproj.github.io/cd/"&gt;ArgoCD&lt;/a&gt;. ArgoCD reads a YAML configuration file and uses the Kubernetes API to make the cluster consistent with the code specified in the YAML config.&lt;/p&gt;

&lt;p&gt;ArgoCD uses custom resources in Kubernetes (called Applications and AppProjects) to manage the source infrastructure as code repositories. ArgoCD also manages the file paths for these repositories as well as the deployment destinations for specific Kubernetes clusters and namespaces.&lt;/p&gt;

&lt;p&gt;Because we maintain multiple clusters, we also use ArgoCD to police itself and manage the definitions of all the different ArgoCD Applications and AppProjects. This is a common development approach, often referred to as the “app of apps” pattern.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/1EomfMv0FWReSQFpUGDfjf/86aa960c316c531402cc0ed33af5a43e/app-of-apps-pattern-OG.jpg" alt="app-of-apps-pattern-OG" /&gt;&lt;/p&gt;

&lt;p&gt;We use a language called jsonnet to create a template of the YAML configuration. The CD system detects changes in the jsonnet, converts the jsonnet into YAML, and then Argo applies the changes. At the time of our incident, all resources for a single application were kept in a single YAML file.&lt;/p&gt;

&lt;p&gt;The object names and directory structures follow certain naming conventions &lt;em&gt;(app name)&lt;/em&gt;–&lt;em&gt;(cluster name)&lt;/em&gt; for object names and env/(cluster name)/(app name)/yml for wherein the repository its definition is kept. For example, app01 in cluster01 is defined as app01-cluster01 and its definition is kept under path env/cluster01/app01/yml.&lt;/p&gt;

&lt;p&gt;We perform a code review of our Infrastructure as Code, which includes inspecting the resulting YAML and ensuring that it will function as expected before applying the update.&lt;/p&gt;

&lt;h2&gt;What happened&lt;/h2&gt;

&lt;p&gt;The ordeal began with a single line of code in a configuration file. Someone on the team created a PR that added several new objects to the config file and to the rendered YAML file.&lt;/p&gt;

&lt;p&gt;In this case, one of the added objects was a new ArgoCD Application and AppProject. Due to an error in automation, the names of the objects were wrong. They should have been named &lt;strong&gt;app02&lt;/strong&gt;-cluster01, but instead were named &lt;strong&gt;app01&lt;/strong&gt;-cluster01. The code review missed the difference between app01 and app02 so, when rendered, both resources ended up in a single YAML configuration file.&lt;/p&gt;

&lt;p&gt;&lt;img src="//images.ctfassets.net/o7xu9whrs0u9/3iOzUpitC3bx8G33gBfXPo/580a7fdb5fae0804e46776084eec4a59/ArgoCD-AppProject.png" alt="ArgoCD-AppProject" /&gt;&lt;/p&gt;

&lt;p&gt;When we merged the PR with the misnamed objects, ArgoCD read the entire generated YAML file and applied all objects in the order they were listed in the file. As a result, the last object listed “wins” and gets applied, which is what happened. ArgoCD replaced the previous instance app1 with the new one. The problem was that the instance of app1 that ArgoCD deleted was InfluxDB Cloud’s core workload.&lt;/p&gt;

&lt;p&gt;Furthermore, the new object created an additional workload that we didn’t want to enable on that cluster. In short, when ArgoCD replaced the instance of app01, that process triggered an immediate deletion of an entire production environment.&lt;/p&gt;

&lt;p&gt;Obviously, this was not good for our users. When production went down all API endpoints, including all writes and reads, returned 404 errors. During the outage, no one was able to collect data, tasks failed to run, and external queries didn’t work.&lt;/p&gt;

&lt;h2&gt;Disaster recovery — planning and initial attempts&lt;/h2&gt;

&lt;p&gt;We immediately set to work to fix the issue, beginning by reviewing the code in the merged PR. The issue was difficult to spot because it involved an ArgoCD collision between a project and an application name.&lt;/p&gt;

&lt;p&gt;Our first intuition was to revert the change to get things back to normal. Unfortunately, that’s not exactly how stateful applications work. We started the reversion process, but stopped almost immediately because reverting the change would cause ArgoCD to create a brand new instance of our application. This new instance wouldn’t have the metadata about our users, dashboards, and tasks that the original instance had. Critically, the new instance wouldn’t have the most important thing — our customers’ data.&lt;/p&gt;

&lt;p&gt;At this point, it’s worth mentioning that we store all the data in an InfluxDB Cloud cluster in volumes that use a reclaimPolicy: Retain. This means that even if the Kubernetes resources we manage such as StatefulSet and/or PersistentVolumeClaim (PVC) are deleted, the underlying PersistentVolumes and the volumes in the cloud are &lt;strong&gt;not&lt;/strong&gt; deleted.&lt;/p&gt;

&lt;p&gt;We created our recovery plan with this critical detail in mind. We had to manually recreate all of the underlying Kubernetes objects, such as PVCs. Once the new objects were up and running, we needed to restore any missing data from backup systems and then have ArgoCD recreate the stateless parts of our application.&lt;/p&gt;

&lt;h2&gt;Disaster recovery — restoring state and data&lt;/h2&gt;

&lt;p&gt;InfluxDB Cloud keeps state in a few components of the system that other microservices interact with, including:&lt;/p&gt;
&lt;ul&gt;
 	&lt;li&gt;&lt;strong&gt;Etcd&lt;/strong&gt;: Used for metadata, this exists on a dedicated cluster separate from the Kubernetes control plane.&lt;/li&gt;
 	&lt;li&gt;&lt;strong&gt;Kafka and Zookeeper&lt;/strong&gt;: Used for Write-Ahead Logs (WALs).&lt;/li&gt;
 	&lt;li&gt;&lt;strong&gt;Storage engine&lt;/strong&gt;: This includes PVCs and object store for persistence.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The team started by restoring etcd and our metadata. This was probably the most straightforward task in the recovery process because etcd stores a relatively small data set so we were able to get the etcd cluster up and running quickly. This was an easy win for us and allowed us to focus all our attention on the more involved recovery tasks, like Kafka and storage.&lt;/p&gt;

&lt;p&gt;We identified and recreated any missing Kubernetes objects, which brought the volumes (specifically Persistent Volume objects) back online and put them in an available state. Once the issue with volumes was fixed, we recreated the StatefulSet, which ensures that all the pods run and cluster in sync.&lt;/p&gt;

&lt;p&gt;The next step was to restore Kafka and to do that we also had to get Zookeeper, which keeps metadata for the Kafka cluster, in a healthy state. The Zookeeper volumes also got deleted in the incident. Fortunately, we use Velero to backup Zookeeper hourly, and Zookeeper’s data does not change often. We successfully restored the Zookeeper volumes from a recent backup, which was sufficient to get it up and running.&lt;/p&gt;

&lt;p&gt;To restore Kafka we had to create any missing objects related to the volumes and state of Kafka, then recreate the cluster’s StatefulSet one pod at a time. We decided to disable all the health and readiness checks to get the Kafka cluster in a healthy state. This is because we had to create the pods in StatefulSet one at a time and Kafka does not become ready until the cluster leader is up. Temporarily disabling checks allowed us to create all necessary pods, including the cluster leader so that the Kafka cluster reported as healthy.&lt;/p&gt;

&lt;p&gt;Because Kafka and etcd are independent of each other, we could have worked on restoring both in parallel. However, we wanted to be sure to have correct procedures in place, so we opted to restore them one at a time.&lt;/p&gt;

&lt;p&gt;Once Kafka and etcd came back online, we could re-enable parts of InfluxDB Cloud to start accepting writes. Because we use Kafka as our Write-Ahead Log (WAL), even without storage functioning properly, we could accept writes to the system and add them to the WAL. InfluxDB Cloud would process these writes as soon as the other parts came back online.&lt;/p&gt;

&lt;p&gt;As writes became available, we became worried that our instance would get overwhelmed with requests from &lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;Telegraf&lt;/a&gt; and other clients writing data that buffered while the clusters were down. To guard against this, we resized the components that handle write requests, increasing the number of replicas and increasing memory requests and limits. This helped us handle a temporary spike in writes and ingest all the data into Kafka.&lt;/p&gt;

&lt;p&gt;To fix the storage components, we recreated all the storage pods. InfluxDB also backs up all time series data to an object store (e.g., AWS S3, Azure Blob Store, and Google Cloud Storage). As pods came up, they downloaded a copy of data from object storage and then indexed all the data to allow efficient reading. After that process was completed, each storage pod contacted Kafka and read any unprocessed data in WAL.&lt;/p&gt;

&lt;h2&gt;Disaster recovery — final phase&lt;/h2&gt;

&lt;p&gt;Once the process of creating the storage pods and indexing existing data was underway, the disaster recovery team was able to focus on fixing other parts of the system.&lt;/p&gt;

&lt;p&gt;We changed some of the settings for the storage cluster, reducing the number of replicas for some services to allow the pieces coming back online to start faster. At this point, we re-enabled ArgoCD so it could create any Kubernetes objects still missing.&lt;/p&gt;

&lt;p&gt;After the initial deployment and storage engine became fully functional, we could re-enable functionality for key processes, like querying data and viewing dashboards. While this process continued, we started to recreate the proper number of replicas for all resources, and re-enabled any remaining functionality.&lt;/p&gt;

&lt;p&gt;Finally, with all the components deployed with the expected number of replicas and everything in a healthy and ready state, the team enabled scheduled tasks and did final QA checks to make sure that everything was running properly.&lt;/p&gt;

&lt;p&gt;In total, from the time the PR got merged to the time we restored full functionality was just under six hours.&lt;/p&gt;

&lt;h2&gt;What we learned&lt;/h2&gt;

&lt;p&gt;After the incident, we performed a proper post-mortem to analyze what went well and what we could improve for future incidents.&lt;/p&gt;

&lt;p&gt;On the positive side of things, we were able to recover the system without losing any data. Any tools that retry writing data to InfluxDB continued to do so throughout the outage and eventually, that data was written to the InfluxDB Cloud offering. For example, &lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;T&lt;/a&gt;&lt;a href="https://www.influxdata.com/time-series-platform/telegraf/"&gt;elegraf&lt;/a&gt;, our open source collection agent, performs retries by default.&lt;/p&gt;

&lt;p&gt;The most significant problem was that our monitoring and alerting systems did not detect this issue right away. That is why our initial response was to try to roll back the change as opposed to planning and performing a thought-out recovery process. We also lacked a runbook for losing part, or an entire instance of InfluxDB Cloud.&lt;/p&gt;

&lt;p&gt;As an outcome of this incident, InfluxData engineering created runbooks focused on restoring state. We now have detailed instructions on how to proceed if a similar situation occurs, i.e., if Kubernetes objects (such as Persistent Volume Claims) get deleted, but the data on the underlying disks and volumes are preserved. We also made sure that all volumes in all our environments are set to retain data, even if the PVC object gets deleted.&lt;/p&gt;

&lt;p&gt;We have also improved our process for handling public-facing incidents. We aim to have as few incidents as possible, this should help us in any future problem with our platform that may be public-facing.&lt;/p&gt;

&lt;p&gt;On the technical side, we realized our systems should have prevented the PR from being merged and we took multiple steps to address this. We changed how InfluxDB stores generated YAML files, moving to a one object per file approach. For example v1.Service-(namespace).etcd.yaml for an etcd Service. In the future, a similar PR would clearly be shown as an overwrite of an existing object and would not be mistaken for an addition of a new object.&lt;/p&gt;

&lt;p&gt;We also improved our tooling to detect duplicates when generating YAML files. The system now warns everyone of duplicates before submitting a change for review. Also, due to how Kubernetes works, the detection logic looks at more than just filenames. For example, apiVersion includes both the group name and version — objects with apiVersion networking.k8s.io/v1beta1 and networking.k8s.io/v1 and same namespace and name should be considered same objects despite the apiVersion string being different.&lt;/p&gt;

&lt;p&gt;This incident was a valuable lesson in configuring our CD. ArgoCD allows adding specific annotations that prevent the deletion of certain resources. Adding a Prune=false annotation to all our stateful resources ensures ArgoCD leaves those resources intact in the event of misconfiguration issues. We also add the annotation to Namespace objects managed by ArgoCD, otherwise, ArgoCD will leave StatefulSet, but may still delete the Namespace it is in, causing cascade deletion of all objects.&lt;/p&gt;

&lt;p&gt;We also added the FailOnSharedResource=true option for ArgoCD Application objects. This makes ArgoCD fail before attempting to apply any changes to an object that is or was previously managed by another ArgoCD application. This ensures that similar errors, or pointing ArgoCD at wrong clusters or namespaces, would prevent it from causing any changes to existing objects.&lt;/p&gt;

&lt;h2&gt;One final note&lt;/h2&gt;
&lt;p&gt;While these are all changes we already wanted to make, and the incident spurred us to implement them to improve all our automation and processes. Hopefully, this deep dive into our experience will help you put an effective disaster recovery plan in place.&lt;/p&gt;
</description>
      <pubDate>Fri, 24 Jun 2022 07:00:00 +0000</pubDate>
      <link>https://www.influxdata.com/blog/deleting-production-steps-how-fix-it/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/deleting-production-steps-how-fix-it/</guid>
      <category>Use Cases</category>
      <category>Product</category>
      <category>Developer</category>
      <author>Wojciech Kocjan (InfluxData)</author>
    </item>
    <item>
      <title>Expand Kubernetes Monitoring with Telegraf Operator</title>
      <description>&lt;p&gt;&lt;em&gt;This article was originally published in &lt;a href="https://thenewstack.io/expand-kubernetes-monitoring-with-telegraf-operator/"&gt;&lt;strong&gt;The New Stack&lt;/strong&gt;&lt;/a&gt;. &lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Monitoring is a critical aspect of cloud computing. At any time, you need to know what’s working, what isn’t, and have the ability to respond to changes occurring in a given environment. Effective monitoring begins with the ability to collect performance data from across an ecosystem and present it in a useful way. So the easier it is to manage monitoring data across an ecosystem, the more effective those monitoring solutions are and the more efficient that ecosystem is.&lt;/p&gt;

&lt;p&gt;Kubernetes is a cloud computing workhorse, and the automation it provides is a game changer. Still, unchecked automation has the potential to create issues, so it’s necessary to monitor those automated processes. A popular monitoring solution for Kubernetes environments is Prometheus.&lt;/p&gt;

&lt;p&gt;However, not all applications run exclusively in Kubernetes. If you want to use Prometheus to pull together metrics data from across multiple environments, including custom application servers, legacy systems and technology, you’re going to end up writing a lot of custom code to be able to access and ingest those metrics.&lt;/p&gt;

&lt;p&gt;Enter Telegraf Operator, an environment-agnostic Prometheus alternative.&lt;/p&gt;

&lt;h2&gt;What Is Telegraf Operator?&lt;/h2&gt;

&lt;p&gt;First, we should distinguish between Telegraf and the Telegraf Operator.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.influxdata.com/time-series-platform/telegraf/?utm_source=vendor&amp;amp;utm_medium=referral&amp;amp;utm_campaign=2021-09-23_blog_telegraf-operator_tns&amp;amp;utm_content=tns"&gt;Telegraf&lt;/a&gt; is an open source server agent designed to collect metrics from stacks, sensors and systems.&lt;/p&gt;

&lt;p&gt;The Telegraf Operator, on the other hand, is an application designed to create and manage individual Telegraf instances in Kubernetes clusters. Essentially, it functions as a control plane for managing the individual Telegraf instances deployed throughout your Kubernetes cluster. Telegraf Operator is a standalone application, and it’s deployed separately from Telegraf.&lt;/p&gt;

&lt;h2&gt;Telegraf Operator Considerations&lt;/h2&gt;

&lt;p&gt;On a basic level, the Telegraf Operator scrapes metrics from applications with exposed endpoints in your Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;There are two main mechanisms for deploying monitoring agents, DaemonSet and sidecar. Depending on what, exactly, you want to monitor, which mechanism you should use differs. With InfluxDB, we recommend:&lt;/p&gt;
&lt;ul&gt;
 	&lt;li&gt;DaemonSet for node, pod and container metrics.&lt;/li&gt;
 	&lt;li&gt;Sidecar monitoring for microservices that expose large amounts of metrics.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In a DaemonSet scenario, Telegraf runs on each individual node and collects infrastructure metrics on the node itself.&lt;/p&gt;

&lt;p&gt;By contrast, in the sidecar deployment, the containerized instance of Telegraf shares the pod with the application container and scrapes data from exposed endpoints of that application.&lt;/p&gt;

&lt;h3&gt;Sidecar deployment&lt;/h3&gt;

&lt;p&gt;But where do sidecar Telegraf containers come from? That’s where Telegraf Operator comes into play.&lt;/p&gt;

&lt;p&gt;Telegraf Operator works at the pod level, so you can use it with anything that creates a pod object in your Kubernetes environment. When something — a deployment, StatefulSet, DaemonSet, Job or CronJob — sends out a request to create a new pod, Telegraf Operator intercepts that request, using the mutating webhooks functionality in Kubernetes, and gets a chance to apply changes to it.&lt;/p&gt;

&lt;p&gt;Telegraf Operator reads the &lt;a href="https://github.com/influxdata/telegraf-operator#pod-level-annotations"&gt;pod annotations&lt;/a&gt; in the request and if an annotation says to add a Telegraf sidecar, then Telegraf Operator adds that instance as an additional container within that pod. In other words, Telegraf Operator looks at the list of containers for the new pod and adds another container to the list if instructed to do so by the annotations.&lt;/p&gt;

&lt;p&gt;Once the Telegraf sidecar container is in place, it can begin scraping data and pushing metrics to a database such as InfluxDB.&lt;/p&gt;

&lt;p&gt;&lt;img src="/images/legacy-uploads/telegraf-sidecar-deployment-kubernetes.png" alt="" width="637" height="443" /&gt;&lt;/p&gt;

&lt;p&gt;Using a sidecar deployment for Kubernetes monitoring has several advantages. A sidecar monitoring agent lets you define custom metrics and monitoring of a specific application without affecting the overall monitoring framework shared by other workloads. This approach keeps endpoint exposure manageable. As more endpoints get exposed for an application, the sidecar approach facilitates better scalability because that Telegraf instance only scrapes data for the application in its pod.&lt;/p&gt;

&lt;h2&gt;Is the Telegraf Operator Right for You?&lt;/h2&gt;

&lt;p&gt;The answer to this question really depends on your ecosystem and what you’re trying to monitor. There are many different options and possibilities. We’ve outlined a few here.&lt;/p&gt;

&lt;h3&gt;Replacing Prometheus&lt;/h3&gt;

&lt;p&gt;Telegraf can function as a Prometheus server, so any metrics you want to collect with Prometheus you can also collect with Telegraf Operator. So, it’s possible to simply replace Prometheus with the Telegraf Operator. In this case, you’d replace the Prometheus exporters with Telegraf sidecar containers, add the annotations expected by Telegraf Operator to your pod specifications,and switch your data storage from Prometheus server to InfluxDB.&lt;/p&gt;

&lt;p&gt;However, if the idea of ripping out all your Prometheus monitoring seems too disruptive, then there are many ways to use Telegraf and the Telegraf Operator to enhance or supplement your current Prometheus monitoring with legacy and custom application metrics.&lt;/p&gt;

&lt;h3&gt;Swap out part of Prometheus&lt;/h3&gt;

&lt;p&gt;If you want greater flexibility and accessibility for a diverse range of metrics, one option is to configure your Prometheus server to write directly to Telegraf. You can do this via the &lt;a href="https://www.influxdata.com/blog/prometheus-remote-write-support-with-influxdb-2-0/?utm_source=vendor&amp;amp;utm_medium=referral&amp;amp;utm_campaign=2021-09-23_blog_telegraf-operator_tns&amp;amp;utm_content=tns"&gt;Prometheus Remote Write Telegraf plugin&lt;/a&gt;. You can configure the plugin to send the collected metrics to any database you want, such as InfluxDB. This setup allows you to send metrics from Prometheus server directly into InfluxDB, or, depending on your configuration, you can even send metrics to multiple locations. This is helpful if you want to do dual writing or create a backup system scenario for your metric data.&lt;/p&gt;

&lt;p&gt;Telegraf can also function as a Prometheus server, so another option is to replace your Prometheus server with Telegraf. You can keep your Prometheus exporters in place because Telegraf is able to ingest that data. Once Telegraf collects that data, you have the same flexibility to send it wherever you want.&lt;/p&gt;

&lt;h3&gt;Run Telegraf and Prometheus in parallel&lt;/h3&gt;

&lt;p&gt;Yet another option is to run Prometheus and Telegraf in parallel. Both applications function the same way, scraping the data presented by the Prometheus exporters. You can configure both services to scrape data and then run side-by-side comparisons on that data, if necessary.&lt;/p&gt;

&lt;p&gt;Another possibility with this setup is to use the Telegraf instance to write data somewhere externally from your Kubernetes environment and use the Prometheus server for any needs you have within your Kubernetes environment.&lt;/p&gt;

&lt;h3&gt;Big-picture data collection&lt;/h3&gt;

&lt;p&gt;In addition to these sidecar use cases, you can also use Telegraf Operator to run DaemonSet monitoring simultaneously, so you can get metrics on the actual pods and nodes. Doing this saves metric data for the entire ecosystem in one place, providing centralized monitoring for every aspect of your ecosystem.&lt;/p&gt;

&lt;p&gt;If Kubernetes is just one of several environments your system runs on, the Telegraf Operator may be a better fit. As mentioned above, if you’re using Kubernetes and want to collect metrics from non-Kubernetes environments, things get more complicated. You’ll have to write a custom exporter for each type of technology you want metrics from and then configure it so that the Prometheus server can scrape those metrics.&lt;/p&gt;

&lt;p&gt;By contrast, Telegraf has hundreds of available plugins that collect metrics not only from Kubernetes, but from external environments as well.&lt;/p&gt;

&lt;h2&gt;Installing the Telegraf Operator in Kubernetes&lt;/h2&gt;

&lt;p&gt;The &lt;code class="language-yaml"&gt;telegraf-operator&lt;/code&gt; starts a pod in the cluster in its own namespace. Installing the &lt;code class="language-yaml"&gt;telegraf-operator&lt;/code&gt; is very simple and you can do it via kubectl, as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code class="language-yaml"&gt;kubectl apply -f telegraf-operator.yml&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;(You can find an example of the yml file in the &lt;a href="https://github.com/influxdata/telegraf-operator/blob/master/deploy/dev.yml"&gt;deploy directory&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;You can also use other tools, such as &lt;a href="https://github.com/influxdata/helm-charts/tree/master/charts/telegraf-operator"&gt;Helm&lt;/a&gt; or Jsonnet, to install telegraf-operator.&lt;/p&gt;

&lt;p&gt;&lt;code class="language-yaml"&gt;helm upgrade --install my-release influxdata/telegraf-operator&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Once installed, Telegraf Operator watches for pods deployed with a specific set of &lt;a href="https://github.com/influxdata/telegraf-operator#pod-level-annotations"&gt;pod annotations&lt;/a&gt;, as mentioned above. The advantage to using Telegraf Operator is that you only have to define the input plugin configuration for Telegraf when creating the pod annotations. Telegraf Operator then sets the configuration for the entire cluster so your users don’t need to worry about configuring a metrics destination when deploying applications.&lt;/p&gt;

&lt;h2&gt;Start scraping metrics&lt;/h2&gt;

&lt;p&gt;Once you’ve installed Telegraf Operator, you just need to annotate the pod of the application container to start scraping your application or metrics endpoint.&lt;/p&gt;

&lt;p&gt;Here’s an example of a DaemonSet deployment YAML file with Telegraf configuration data:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-yaml"&gt;apiVersion: apps/v1
kind: DaemonSet
metadata:
 name: my-application
 namespace: default
spec:
 selector:
   matchLabels:
     app: my-application
 template:
   metadata:
     labels:
       app: my-application
     annotations:
       telegraf.influxdata.com/class: app
       telegraf.influxdata.com/port: "8080"
       telegraf.influxdata.com/path: /v1/metrics
       telegraf.influxdata.com/interval: 5s
       telegraf.influxdata.com/scheme: http
       telegraf.influxdata.com/internal: "true"
   spec:
     containers:
     - name: my-application
       image: my-application:latest&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And here’s a &lt;a href="https://github.com/influxdata/telegraf-operator/blob/master/examples/redis.yml"&gt;sample of a StatefulSet deployment of Redis&lt;/a&gt; YAML file with Telegraf configuration data:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-yaml"&gt;apiVersion: apps/v1
kind: StatefulSet
metadata:
 name: redis
 namespace: test
spec:
 selector:
   matchLabels:
     app: redis
 serviceName: redis
 template:
   metadata:
     labels:
       app: redis
     annotations:
       telegraf.influxdata.com/inputs: |+
         [[inputs.redis]]
           servers = ["tcp://localhost:6379"]
       telegraf.influxdata.com/class: app
   spec:
     containers:
     - name: redis
       image: redis:alpine&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Configuring the Telegraf Operator&lt;/h2&gt;

&lt;p&gt;As mentioned above, &lt;code class="language-yaml"&gt;telegraf-operator&lt;/code&gt; reads pod annotations to determine whether to inject the Telegraf sidecar and what configuration to apply.&lt;/p&gt;

&lt;p&gt;Use the &lt;code class="language-yaml"&gt;telegraf.influxdata.com/inputs&lt;/code&gt; annotation to pass telegraf configuration statements. You can pass configurations for any of the more than 200 Telegraf plugins this way. For Prometheus-based metrics, add &lt;code class="language-yaml"&gt;telegraf.influxdata.com/port&lt;/code&gt; along with any other annotations, such as &lt;code class="language-yaml"&gt;telegraf.influxdata.com/path&lt;/code&gt; or &lt;code class="language-yaml"&gt;telegraf.influxdata.com/interval&lt;/code&gt;, and &lt;code class="language-yaml"&gt;telegraf-operator&lt;/code&gt; generates part of the configuration for &lt;code class="language-yaml"&gt;inputs.prometheus&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;code class="language-yaml"&gt;telegraf.influxdata.com/class&lt;/code&gt; annotation specifies class of monitoring for the pod. A Kubernetes secret defines the classes, which gets read by &lt;code class="language-yaml"&gt;telegraf-operator&lt;/code&gt; and later combined into the final configuration for Telegraf.&lt;/p&gt;

&lt;p&gt;Classes usually specify outputs where the data should be sent, such as:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-yaml"&gt;apiVersion: v1
kind: Secret
... 
spec:
  stringData:
    app: |+
      [[outputs.influxdb]]
        urls = ["http://influxdb.influxdb:8086"]
      [[outputs.file]]
        files = ["stdout"]
      [global_tags]
        hostname = "$HOSTNAME"
        nodename = "$NODENAME"
        type = "app"&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;a href="https://github.com/influxdata/telegraf-operator#pod-level-annotations"&gt;Pod-level annotations documentation&lt;/a&gt; describes all supported annotations. &lt;a href="https://github.com/influxdata/telegraf-operator#global-configuration---classes"&gt;Global configuration – classes documentation&lt;/a&gt; defines the classes.&lt;/p&gt;

&lt;p&gt;As of version 1.3.0, telegraf-operator supports hot reloading configurations. This also requires Telegraf version 1.19. With the new features, changing the global configuration triggers all Telegraf sidecars’ configurations to be updated and reloaded by the telegraf process, without the need to manually restart any of the pods. More details can be found in &lt;a href="https://github.com/influxdata/telegraf-operator#hot-reload"&gt;Hot reload documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Contribute to Telegraf Operator&lt;/h2&gt;

&lt;p&gt;At InfluxData we love open source, so if you’re interested in contributing to the Telegraf Operator plugin, we’d love to hear from you. You can reach us on Slack or check out our &lt;a href="https://github.com/influxdata/telegraf-operator"&gt;GitHub repos&lt;/a&gt; for more information.&lt;/p&gt;
</description>
      <pubDate>Fri, 03 Dec 2021 04:00:42 -0700</pubDate>
      <link>https://www.influxdata.com/blog/expand-kubernetes-monitoring-telegraf-operator/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/expand-kubernetes-monitoring-telegraf-operator/</guid>
      <category>Use Cases</category>
      <category>Product</category>
      <category>Developer</category>
      <author>Wojciech Kocjan (InfluxData)</author>
    </item>
    <item>
      <title>Using Telegraf on Windows</title>
      <description>&lt;p&gt;Telegraf is an agent that runs on your operating system of choice, schedules gathering metrics and events from various sources and then sends them to one or more sinks, such as InfluxDB or Kafka. For InfluxDB, version 1.x, 2.0 as well as &lt;a href="https://cloud2.influxdata.com/"&gt;InfluxDB Cloud&lt;/a&gt; are supported. Telegraf can collect information from multiple inputs and currently includes over 200 plugins for retrieving information from multiple types of applications. It can also retrieve information about hardware and software from the OS.&lt;/p&gt;

&lt;p&gt;One of the questions that gets asked often is: What is the best way to run Telegraf on Windows machines? Our &lt;a href="https://github.com/influxdata/telegraf"&gt;GitHub repository&lt;/a&gt; provides documentation on &lt;a href="https://github.com/influxdata/telegraf/blob/master/docs/WINDOWS_SERVICE.md"&gt;Running Telegraf as a Windows Service&lt;/a&gt;. However, in this post, we’re going to go through a step-by-step setup of Telegraf on Windows, including how to securely configure it with credentials for pushing data to various InfluxDB solutions.&lt;/p&gt;

&lt;p&gt;We will be doing our installation using an elevated PowerShell process.&lt;/p&gt;

&lt;p&gt;In order to run an elevated session of PowerShell, open the Start Menu, find PowerShell, right-click on it and choose the &lt;strong&gt;Run as administrator&lt;/strong&gt; option.&lt;/p&gt;

&lt;p&gt;Now, let’s download Windows binaries of Telegraf. Those are available from the &lt;a href="https://portal.influxdata.com/downloads/"&gt;https://portal.influxdata.com/downloads/&lt;/a&gt; URL. The example below uses the Invoke-Webrequest PowerShell command for download:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; cd ~\Downloads
PS&amp;gt; Invoke-WebRequest https://dl.influxdata.com/telegraf/releases/telegraf-1.80.0_windows_amd64.zip -OutFile telegraf.zip&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, let’s extract the archive into Program Files folder, which will create C:\Program Files\telegraf folder:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; Expand-Archive .\telegraf.zip 'C:\Program Files\'&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then create a &lt;strong&gt;conf&lt;/strong&gt; subdirectory and copy the &lt;strong&gt;telegraf.conf&lt;/strong&gt; as &lt;strong&gt;conf\inputs.conf&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; mkdir 'C:\Program Files\telegraf\conf'
PS&amp;gt; cd 'C:\Program Files\telegraf\conf'
PS&amp;gt; copy ..\telegraf.conf inputs.conf&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Copy the &lt;strong&gt;telegraf.conf&lt;/strong&gt; as &lt;strong&gt;conf\inputs.conf&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We’re going to separate the outputs section of the file and configure sending data to InfluxDB Cloud specifically. We’ll remove the outputs section from &lt;strong&gt;inputs.conf&lt;/strong&gt;. Edit the file and remove all of the content before the inputs section, leaving the content of the file starting with and including the below lines:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-markup"&gt;###############################################################################
#                                  INPUTS                                     #
###############################################################################&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For editing files, it’s recommended that you start your editor from the elevated PowerShell session — the editor started from an elevated process will have access to write the files.&lt;/p&gt;

&lt;p&gt;Now, create &lt;strong&gt;conf\outputs.conf&lt;/strong&gt; file that specifies where the data should be sent.&lt;/p&gt;

&lt;p&gt;In my case, I want the output to go to my InfluxDB Cloud account, so the file will contain:&lt;/p&gt;
&lt;pre class="line-numbers"&gt;&lt;code class="language-ini"&gt;[[outputs.influxdb_v2]]
  # URL to InfluxDB cloud or your own instance of InfluxDB 2.0
  urls = ["https://us-west-2-1.aws.cloud2.influxdata.com"]
  ## Token for authentication.
  token = "$INFLUX_TOKEN"
  ## Organization is the name of the organization you wish to write to; must exist.
  organization = "$INFLUX_ORG"
  bucket = "$INFLUX_BUCKET"&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For sending data to other instances and/or versions of InfluxDB, the outputs section may differ. Also note that Telegraf can send data to more than one destination, such as InfluxDB 1.x and InfluxDB 2.0.&lt;/p&gt;

&lt;p&gt;We recommend that &lt;strong&gt;$INFLUX_TOKEN&lt;/strong&gt;, &lt;strong&gt;$INFLUX_ORG&lt;/strong&gt; and &lt;strong&gt;$INFLUX_BUCKET&lt;/strong&gt; as well as any other connectivity information are replaced with your access token, organization name, the name of the InfluxDB bucket to write data to and any other connectivity information.&lt;/p&gt;

&lt;p&gt;At this point it is a good idea to test that Telegraf works correctly:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; .\telegraf --config-directory 'C:\Program Files\telegraf\conf' --test&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This should output logs indicating telegraf has started, followed by multiple lines of data retrieved from all of the input plugins.&lt;/p&gt;

&lt;p&gt;Next, let’s ensure that only the Local System user account can read the &lt;strong&gt;outputs.conf&lt;/strong&gt; file to prevent unauthorized users from retrieving our access token for InfluxDB.&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; icacls outputs.conf /reset
PS&amp;gt; icacls outputs.conf /inheritance:r /grant system:r&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The icacls command is a built-in tool for managing access control lists (ACLs) for objects in Microsoft Windows and is described in more detail &lt;a href="https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/icacls"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first command removes all ACLs and only inherits permissions from the parent object — in our case the &lt;strong&gt;C:\Program Files\telegraf\conf&lt;/strong&gt; directory. The second command does multiple things — the&lt;strong&gt; /reset&lt;/strong&gt; flag disables inheritance, effectively removing any ACLs for the file. At this point no user can access the file. The second flag and its value — &lt;strong&gt;/grant system:r&lt;/strong&gt; — allows the &lt;strong&gt;Local System&lt;/strong&gt; built-in account to read the file.&lt;/p&gt;

&lt;p&gt;This way only the Telegraf service will be able to read the configuration on where the data is sent, including the token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; All users with administrator access to the Windows machine will be able to change the permissions of the file and read it. However, this prevents non-admin users from retrieving the information.&lt;/p&gt;

&lt;p&gt;We can now install Telegraf as a Windows service so that it starts automatically along with our system. To do this, simply run:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-powershell"&gt;PS&amp;gt; cd 'C:\Program Files'
PS&amp;gt; .\telegraf --service install --config-directory 'C:\Program Files\telegraf\conf'
PS&amp;gt; net start&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will create a Telegraf service and start it. The output should include the following message:&lt;/p&gt;
&lt;pre&gt;&lt;code class="language-bash"&gt;The Telegraf Data Collector Service service is starting.
The Telegraf Data Collector Service service was started successfully.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point our Telegraf is now ready to run and we have applied best practices for storing and accessing the credentials for sending data to InfluxDB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; As part of security best practices, the token created for Telegraf should also have its scope limited — only being able to write data to the specified bucket where it should be sent.&lt;/p&gt;

&lt;p&gt;As an alternative, it’s also possible to keep &lt;strong&gt;$INFLUX_TOKEN&lt;/strong&gt;, &lt;strong&gt;$INFLUX_ORG&lt;/strong&gt; and &lt;strong&gt;$INFLUX_BUCKET&lt;/strong&gt; in your configuration file. Those values will get read and replaced with environment variables by the Telegraf service.&lt;/p&gt;

&lt;p&gt;By default, Windows services use all of the environment variables set by Microsoft Windows as well as system-wide environment variables. It’s also possible to pass environment variables specific to a service by setting them in registry key related to that service.&lt;/p&gt;

&lt;p&gt;In order to pass additional environment variables to Telegraf service, run &lt;a href="https://support.microsoft.com/en-us/help/4027573/windows-10-open-registry-editor"&gt;registry editor&lt;/a&gt; and go to &lt;strong&gt;HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\telegraf&lt;/strong&gt; key after setting Telegraf as a system service. This is where Windows maintains all of the information for this specific service.&lt;/p&gt;

&lt;p&gt;Create a Multi-String Value registry with the name Environment:&lt;/p&gt;

&lt;p&gt;&lt;img class="alignnone size-full wp-image-239348 aligncenter" src="/images/legacy-uploads/multi-string-value-registry-telegraf.png" alt="Nulti-String Value Registry - Telegraf" width="558" height="284" /&gt;&lt;/p&gt;

&lt;p&gt;Next, edit the values for the registry, setting each line to a &lt;strong&gt;Key=Value&lt;/strong&gt; format, where &lt;strong&gt;Key&lt;/strong&gt; is environment variable name and &lt;strong&gt;Value&lt;/strong&gt; is its value — such as:&lt;/p&gt;

&lt;p&gt;&lt;img class="size-full wp-image-239349 aligncenter" src="/images/legacy-uploads/edit-registry-values-telegraf.png" alt="Edit registry values - Telegraf" width="512" height="454" /&gt;&lt;/p&gt;

&lt;p&gt;After that the Telegraf service will have the required environment variables set.&lt;/p&gt;

&lt;p&gt;The downside of using Environment registry is that it is harder to manage ACLs and prevent unauthorized users from reading the value. Therefore, if possible, we recommend writing credentials in the file system and using ACL for the configuration file — as ACLs for files can also be inspected using tools such as Windows Explorer.&lt;/p&gt;

&lt;p&gt;At this point, our Windows server, desktop or laptop is now sending its performance metrics and other monitoring data to our InfluxDB database(s) and can be viewed from the &lt;a href="https://v2.docs.influxdata.com/v2.0/visualize-data/explore-metrics/"&gt;Data Explorer&lt;/a&gt;. InfluxDB can also show any information using &lt;a href="https://v2.docs.influxdata.com/v2.0/visualize-data/dashboards/"&gt;Dashboards&lt;/a&gt;.&lt;/p&gt;
</description>
      <pubDate>Fri, 22 Nov 2019 12:02:51 -0700</pubDate>
      <link>https://www.influxdata.com/blog/using-telegraf-on-windows/</link>
      <guid isPermaLink="true">https://www.influxdata.com/blog/using-telegraf-on-windows/</guid>
      <category>Use Cases</category>
      <category>Product</category>
      <category>Developer</category>
      <author>Wojciech Kocjan (InfluxData)</author>
    </item>
  </channel>
</rss>
