InfluxData Blog - Gianluca Arbezzano254

Why You Can't Afford to Ignore Distributed Tracing for Observability

Gianluca Arbezzano254 (InfluxData) — Wed, 30 Jan 2019 09:00:48 -0700

Observability is a hot topic, but not a lot of people know what it truly means. Everyone reads about monitoring vs. observability these days, and I have had the chance to experience what I think is the main concept behind this movement.

First of all, monitoring is complicated. Dashboards don’t scale because they usually reveal the information you need until only after an outage is experienced, and at some point, looking for spikes in your graphs becomes a straining eye exercise. And that’s not monitoring — it is just a “not very” smart way to understand how something isn’t working.In other words, monitoring is just the tip of the iceberg where the solid foundation is the knowledge you have of your system.

However, nowadays distributed systems are too complicated to understand. A single request can go and come between many services and applications, some even not owned by your company. Every one of the actors that handle a request can fail, and when that happens, you need a way to answer the basic question: “What just happened?” Observability does that with a set of tools, methodologies, and mindset that allow you and your team to learn about a system.

In particular, I think distributed tracing is an observability tool that you need to have in your toolkit. Here, I share my experiences that led me to this conclusion.

I work at InfluxData as an SRE in particular on our SaaS product InfluxDB Cloud. A few months ago after a crazy growth spurt, we learned that our home-grown orchestrator (as it was at that time) was not sustainable and it did not give us enough confidence to know that things were in functional order. Luckily for us, we know the code really well because we wrote it from scratch as we were not using a standard configuration management tool or orchestrator like Kubernetes — we wrote our own little orchestrator to solve our specific use case. It was set up on a daemon in an isolated environment on AWS for every customer with InfluxDB, Chronograf, Kapacitor, and other add-ons that our service offers.

We decided to refactor the application starting from the important and most complicated flows:

Cluster Creation: the flow that creates subnets, security groups, instances for a new customer;
Cluster termination: used when a customer stops paying for the service;
Add-On creation: to spin up add-ons like Grafana, Kapacitor, or Chronograf.

To do that we used a pattern called reactive planning. Basically, you create a plan (cluster creation for example) and split the plan into different steps:

Create a security group;
Configure ingress/egress permissions;
Create instances;
Wait for instances to run;
Create load balancer.

There is a scheduler that takes this plan and executes every step one by one and the most important part is that it executes the entire plan a second time. The second time the plan executes it should return no steps because all of them should already be executed, inferring that the plan is solved. This is nice and sweet in this context because it forces you to double check that everything is in working order. For example, the plan checks if the security group is created on AWS, and if it is created, it doesn’t return to that step again. This makes the provisioning rock-solid.

Another good side effect is that all the execution login is in the same place, within the scheduler. It means that there is only a single place to look in order to understand what is going on. The scheduler looks like this:

func (s *Scheduler) Execute(ctx context.Context, p Plan) error {
	for {
                       // Create the plan.
		steps := p.Create(ctx)
		if len(steps) == 0 {
			break
		}
		err := s.react(ctx, steps)
		if err != nil {
			return err
		}
	}
	return nil
}

The react function is recursion because steps can return new steps.

func (s *Scheduler) react(ctx context.Context, steps []Procedure) error {
	for _, step := range steps {
		span, _ := opentracing.StartSpanFromContext(ctx, step.Identifier())
		step.WithSpan(span)

		logger := s.logger
		f := []zapcore.Field{zap.String("step", step.Identifier())}
		zipkinSpan, ok := span.Context().(zipkin.SpanContext)
		if ok == true && zipkinSpan.TraceID.Empty() == false {
			f = append(f, zap.String("trace_id", zipkinSpan.TraceID.ToHex()))
		}
		logger = s.logger.With(f...)
		step.WithLogger(logger)

		innerSteps, err := step.Do(ctx)
		if err != nil {
                             ….
		}
		span.Finish()
		if len(innerSteps) > 0 {
			if err := s.react(ctx, innerSteps); err != nil {
				return err
			}
		}
	}
	return nil
}

At every step, other than the Do() function, the logic to execute has two functions: WithLogger and WithSpan. I added this function only for observability purposes. As you can see, I am instrumenting my plan using opentracing.

With these few lines of code, I am able to configure the logger to always expose the trace_id, and in this way, I can easily query my logs per request. Inside the step, I have the span to easily access and to gain more context based on what happens inside the span. For example, we use etcd. The step that saves information inside etcd contains the key and the value. When I look at the trace, I can understand how the record changed during the execution of the plan.

Furthermore, we have a frontend that logs the trace_id for every interaction with the backend:

Nov 15 19:04:45  PATCH https://what.net/v1/clusters/idg trace_id:d572232a8fed45fa 422

I am able to grab this from the log because the backend returns the trace_id as a HEADER. This is another easy way to lookup traces from logs or directly from a particular request if you are able to teach the user to attach it, for example, in a support ticket.

Yet a few days ago, we had a problem. Some of the clusters (a low percentage of them) were failing during creation, but AWS was not reporting any errors (we also trace all the AWS requests by the way). All the resources were created, but the EC2 was not registered in the target group.

The creation of clusters is a complex flow, as there are more than 40 interactions with AWS and it can take more than 10 minutes to complete all the steps. But looking at the trace, it was very easy to understand where the error was. This is a snapshot for the part of the trace that takes the EC2 and registers them to the load balancer for a cluster running:

Every step has a name, and the one that I am looking at is the register_cluster_node_to_lb. As you can see, it calls the AWS.EC2 one time to get all the instance IDs to attach and it calls two times the AWS.ELBv2 service: once to get the right target group and once to register the instances.

Comparing this trace with one that failed makes it easy to see that in the second screen there is only one request to AWS.ELBv2. So that’s where the issue was! It ended up being an error not well-handled.

This specific application doesn’t handle a lot of loads but has critical flows where the ability to troubleshoot is important. Visibility on how a plan executes is a feature, and not an option. That’s why we trace all the AWS requests; for example, even request and response:

Observability is important not only in production but in development as it teaches you how your program works. That’s why I also use this setup in development. Observability is the ability to improve how fast and well you develop a feature. This is great because it makes observability very easy to improve and keep up to date. Alternatively, monitoring is only important in production, and you normally don’t care about triggering alerts during your development cycle.

We currently have this system running in production, which gives me the confidence I need to properly help our support team when they ask me to troubleshoot a customer issue. It is particularly helpful since within a couple of seconds I can visually understand where failures happen since the reactive plans highlight the relevant code to refactor, test or optimize.

I think distributed tracing is key in development cycles because the architecture of modern applications is very different from that of older ones — not because modern applications are smaller or micro but because there are a lot of integration points with external databases, third-party services and so on. Circuit breakers and retry policies make the debugging a lot more complicated, and some failures are unpredictable or very expensive to handle if they don’t happen often. With this in mind, for some failures, it is better to have a fast way to understand the new issue than try to predict and avoid every possible failure because that’s simply not possible anymore.

The Developer's Guide to Not Losing the Metrics You Need

Gianluca Arbezzano254 (InfluxData) — Tue, 04 Dec 2018 09:00:41 -0700

Gathering and storing metrics is one of the many parallel tasks a developer must do through the production cycle. Since you never know when an adverse event might occur, you have the metric you need to debug a problem when and if you need it.

However, you cannot store all metrics forever. This even applies to purpose-built time series database, such as the one InfluxDB offers, which is intended for high-cardinality data. Time series databases may seem “magical” when it comes to scalability, but they do not have the ability to infinitely scale, and at some point, even InfluxDB’s tool will reach a limit.

The limitations of time series databases is why you should manage your storage as a set of tubes with different sizes. When you don’t know whether a metric or a trace will be useful, or if storage becomes too expensive because you collected too many metrics, you can always store them for a short period of time. You can also move them later if they become useful or after aggregating them to decrease their pressure on your system.

If you are using InfluxDB, you can set a minimal retention policy (of just two to three days for example) for all your metrics. You can then move the metrics to another InfluxDB instance (with a longer retention policy) Kapacitor, the TICK Stack’s native data processing engine.

This is a hard issue, but I’m uncomfortable with having logs, traces and metrics in the same place. The end result is easy aggregation between all this different point of view from your system. Because at the end of the day, metrics and logs are just a different representation from the reality of your system—and having them together will behave like a powerful crystal ball capable of answering questions on the state of your system with a much higher level of granularity.

The unique way to store this giant amount of data as described above is via retention policy and data aggregation. One solution is Kapacitor, which can process both stream and batch data from InfluxDB—it lets you plug in your own custom logic or user-defined functions to process alerts with dynamic thresholds, match metrics for patterns and compute statistical anomalies. It also performs specific actions based on these alerts like dynamic load rebalancing.

Using Kapacitor with InfluxDB is simple and can allow you to store data as is or send it as an aggregate. In either case, this is a straightforward way for you to start looking at your metric data to determine what you need to keep—all without the guesswork.

In addition, when it comes to collecting system and application data, developers today also talk about logs, events, and tracing.

Access to metrics is like being able to get a good night’s sleep—while events are like a slap in the face. You first design metrics in the form of pretty graphs to understand normal system operation. When something deviates from that normal functioning, you get a slap that wakes you up when an event occurs.

With every unexpected slap in the middle of the night, confusion follows. You look around you to get information about what is happening, who interrupted your dreams and why, but metrics and events don’t tell the full story. That’s where monitoring alone falls short.

At this point, it is important to remind yourself that monitoring helps you know when something goes wrong but doesn’t answer the following questions:

What is going on?
How can I fix it so I can go back to sleep again?

At this junction, if the problem resides in your application, you start studying logs to see what is going on. If you are in a small low-traffic environment, you will probably find what you are looking for, and then you are done.

However, if yours is a complex system, it can be distributed or heavily reliant on third-party services where logs are massive and you can’t identify what is broken by simply watching them—in this case, you need to reduce the scope of the outage. To do that, you can look at your traces. Although identifying traces can be a challenge, there are two actions you can take:

First, expose the trace_id (the identifier for every trace/request) inside your logs to connect them. It will help you filter logs for a specific request.
Second, teach your support team and customers why the trace_id is essential. They should know that it is the key to finding out what is happening. If you have tech people as customers, it's easier when you provide them with an HTTP HEADER for example. If your customers are nontechnical, then it is a good strategy to have your UI send back the appropriate identifier.

Real issues I encountered led me to write this blog. Everything I wrote is a lesson learned from troubleshooting distributed applications: that is, metrics, events, logs and traces are not mutually exclusive. They are tools to make debugging, monitoring and observability possible. I can’t wait to have a single solution to group them all in order to make my life as a developer more awesome than it already is.

How to Use the Open Source TICK Stack to Spin Up a Modern Monitoring System for Your Application and Infrastructure

Gianluca Arbezzano254 (InfluxData) — Thu, 23 Aug 2018 09:00:40 -0700

Our applications speak, and time series is one of their languages. DevOps, cloud computing and containers have changed how we write and run our applications. This post shows what InfluxData and the community is building to provide a modern and flexible monitoring toolkit based on an open-source set of projects.

During the last decade, everything has changed: containers, virtual machines, cloud computing. In addition, everything moves faster, and we need and environment capable to support this speed. Our applications need to grow, in a way that help us maintain them over time. To do that, we need to understand how our applications behave and be ready to fail and to improve. We have the tools and technology to do that and just need to put it all together to understand how applications are behaving and how infrastructure is growing, and ultimately, to understand system failures or errors in order to improve performance.

Monitoring Logs

We have long been reading logs and have some tools that help us understand how our applications behave. And we do that because:

We need to trust somebody. We don't know how our application behaves just by looking at the screen. We need to understand how our users are using our application and how many errors there are. There are a lot of metrics to track. And all of them come together to build trust in our systems.
We want to predict the future. We want to base those predictions on the metrics and behaviors we identify, to understand whether we are growing, how much we are growing, and how much faster we can grow in the future. With all this information, we can design a plan and maybe predict some bad events (and also because we are not John Rambo, who definitely knows what to do in any kind of situation!).

<figcaption> Example of a log</figcaption>

System monitoring teams use a really powerful command called a “tail” to read logs. That’s how our applications usually speak. When it comes to logs, there is a base or normal state. If our logs are streaming in this normal state, then all is fine. If they stream too fast or too slow, then something is wrong, and corrective action is needed.

This isn’t the smartest way to understand how your application speaks, but it’s the most common way that everybody is using to monitor an application. We can definitely do something better now, and this relates to the nature of logs.

Logs are descriptive and contain a lot of information. But they are really expensive to store in a database. They are not easy to index because they are usually in plain-text format. This means that your engine needs to work hard to understand connections with other logs or to allow you to search them for what you are seeking. If you have a lot of logs or are using logs for anything to happen in your application, you need to have a good system behind you. It’s difficult, but it’s definitely not impossible.

There are a lot of tools and services that help put a log together and figure out what’s happening, such as Logstash, Kibana, Elasticsearch, NewRelic, CloudWatch, Graphite and others. Some of them are offered as-a-service, some are open source projects, and some are both. The point is, there are a lot of choices.

Choosing Log Monitoring Tools

Choosing the right tools always depends on what you are doing. There are use cases where you need logs for forensics or archival. And since logs contain detailed information about what’s happening, you will be able to use them for those use cases. When you are faced with an error, you will be able to ascertain what kind of error it is. Logs are more often than not, used to obtain this kind of information.

However, there are other cases where you just need to know how your application is behaving—if it’s growing or shrinking in size or how the errors are distributed in time. You really don’t need to know why or what’s happening at the root—you just need to know that there is a behavioral change. Time series, on the other hand, are also used in your day to day to help you understand how your systems are performing. They are not as detailed as logs— as they speak another language. For example, CPU, Memory usage are time series.

You can’t just use time series and not use logs because there are some problems that you need to solve with proper logs. And I am not here to debate the benefits of logs over time series or vice versa, since it is likely that you will need both and you will extract value from both.. Not only will you use both, but in fact, logs are a form of time series data. If you take your log and reduce it through a time series and a value then we can do some math on top of them, and logs become easier to index.

You are in practice translating your logs into a time series. If you think about how many logins there are in your application, how many errors you are having, or how many transactions you are doing if you are a financial company, those are all time series because it’s a value, one login, at some point in time. It’s a distribution in time. That’s what time series means. That’s how logs can be translated. This is not an integer or a value. It’s a proper log just from a different point of view.

To simplify, you can reduce your log in just a value and you have a point in time that you can aggregate, compare and so on. If you think about your application for about 10 minutes you can get a lot of time series.

Bonus point: all the resource usage that you can get from a server are time series. And you can visualize them with application stats to understand how a spike in error rate escalates your memory usage.

The below quote holds true: it’s very easy to do something complicated.

As developers, we know that everything that we did five years ago now seems so complicated. Our goal now is to make things simple. When you have something simple, it’s easier to explain to someone else and to maintain. That’s what I’m trying to do with time series: Have just a simple value and time where the value is a number. With this kind of model, you can do some math, aggregate them, create a graph, and you have a less expensive way to extract information from your application. However, InfluxDB compared with traditional and general purpose tools like Cassandra, MySQL, MongoDB, is better suited to handle this kind of data because it provides features dedicated to solve this particular use case like continuous query, retention policy other than an optimized set of features like series and compression.

Using InfluxDB for Log Storage

InfluxDB is a Time Series Database. You can use it to take all the information that your applications or servers generate and push them to this database. It’s a Go binary that you can download, that runs in Windows and Mac, is designed to be really easy to install and start. InfluxDB uses InfluxQL to speak. It means that you can query this database with something that looks similar to something you are already familiar with—SQL. You don’t need to learn another language. Here’s a quick summary of reasons to choose InfluxDB.

Easy to get started with
Familiar query syntax
No external dependencies
Open source
Horizontally scalable
Member of a cohesive Time Series Platform

InfluxDB has a large user base and a large community. Used in combination with the other InfluxData platform components discussed below, it creates a full-stack monitoring system. InfluxDB supports irregular time series (events occurring at irregular time intervals) and regular time series (metrics occurring at regular time intervals), as shown below.

At InfluxData, we have a set of benchmarks to show why you need to pick a proper Time Series Database and not simply your favorite kind of database. The difference in write performance between InfluxDB and comparable databases is pretty big. Benchmarks are usually kind of opinionated, but we try to make them as objective as possible through independent testing. See our benchmarks comparing InfluxDB to Elasticsearch, MongoDB, Cassandra, and OpenTSDB.

Spinning Up A Modern Monitoring System

InfluxData has a full stack open-source set of projects that you can use—Telegraf, InfluxDB, Chronograf, and Kapacitor. These are collectively called the TICK Stack.

<figcaption> Complete stack to build your monitoring or event system</figcaption>

Telegraf is a metrics collection agent. It's also a Go binary that you can download and start. It's really easy. And you install one Telegraf in every server and can configure them to take information from each. Telegraf is plugin-based, with both input and output plugins. If you have a monitoring system already in place and you are looking to have a strong collector, you can use Telegraf.
InfluxDB is our storage engine. All the metrics that come from Telegraf are sent into InfluxDB.
Chronograf is a dashboard to manage and see all the data. From Chronograf, you can also manage InfluxDB and Kapacitor. If you choose not to use Chronograf, there are other projects implementing the InfluxDB output, including Grafana.
Kapacitor, the TICK Stack's native data processing engine, can be configured to listen to metrics and take proactive action on what is happening. It can process both stream and batch data from InfluxDB. You can send Kapacitor alerts to compatible incident management integrations. For example, Kapacitor can send a message to PagerDuty, and you can be called during the night if something is wrong, or it can send a message on Slack.

Starting InfluxDB and starting to play with the entire TICK Stack is really simple. You can just run some binaries or run some Docker containers. And you have a monitoring system up and running that you can use. But the real goal for a monitoring system is to tell you when your infrastructure or application is down. If your monitoring system goes down with your servers, this means that it’s not working. So you need to trust your monitoring system. You need to have it separated so far from your application, from your infrastructure, that you are 100% sure that it stays alive when your application and servers are down. It’s not a simple goal. You need to know that, and it’s not just a set of Docker-run commands.

<figcaption> Managing a monitoring system is not for everyone</figcaption>

Monitoring Kubernetes Architecture

Gianluca Arbezzano254 (InfluxData) — Mon, 19 Mar 2018 07:50:26 -0700

There are two important points you need to think about when monitoring a Kubernetes architecture. One is about the underlying resources, the bare metal Kubernetes is running on. The second is related to every service, ingress, and pod that you deployed. To have good visibility into your clusters you need to get metrics from both so that you can compare and reference these metrics. I am writing this article because at InfluxData, we are getting our hands dirty with Kubernetes and I think it is time to share some of the practices that we applied to our clusters to get your feedback (also because they are working pretty well). You should have a totally dedicated namespace for monitoring. We called it monitoring:

kubectl create namespace monitoring

Do not deploy it on the default namespace. In general, the default namespace should be always empty. I am assuming that you are able to deploy InfluxDB and Chronograf on Kubernetes here, or this article will become an unreadable crappy YAML file.

Just a note about persistent volumes. InfluxDB, Kapacitor, and Chronograf store data on disk. This means that we need to be careful about how we manage them. Otherwise, our data will go away with the container. Kubernetes has a resource called Persistent Volume that helps you mount volumes based on where you are running your cluster. We are using AWS and we claim EBS volumes to manage /var/lib/influxdb and the other directories.

Now that you have your system running, we can use a DeamonSet to deploy Telegraf on every node. This Telegraf agent will take care of resources like iops, network, cpu, memory, disk and other services from the host. In order to do that, we need to share some directories from the host or Telegraf will end up monitoring the container instead of the host.

DaemonSet is a Kubernetes’ resource that distributes containers across all the nodes automatically. It’s very powerful if you need to deploy collectors for metrics or logs like we are doing now.

apiVersion: v1
kind: ConfigMap
metadata:
  name: telegraf
  namespace: monitoring
  labels:
    k8s-app: telegraf
data:
  telegraf.conf: |+
    [global_tags]
      env = "$ENV"
    [agent]
      hostname = "$HOSTNAME"
    [[outputs.influxdb]]
      urls = ["$MONITOR_HOST"] # required
      database = "$MONITOR_DATABASE" # required

      timeout = "5s"
      username = "$MONITOR_USERNAME"
      password = "$MONITOR_PASSWORD"
      
    [[inputs.cpu]]
      percpu = true
      totalcpu = true
      collect_cpu_time = false
      report_active = false
    [[inputs.disk]]
      ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
    [[inputs.diskio]]
    [[inputs.kernel]]
    [[inputs.mem]]
    [[inputs.processes]]
    [[inputs.swap]]
    [[inputs.system]]
    [[inputs.docker]]
      endpoint = "unix:///var/run/docker.sock"
    [[inputs.kubernetes]]
      url = "http://1.1.1.1:10255"

---
# Section: Daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: telegraf
  namespace: monitoring
  labels:
    k8s-app: telegraf
spec:
  selector:
    matchLabels:
      name: telegraf
  template:
    metadata:
      labels:
        name: telegraf
    spec:
      containers:
      - name: telegraf
        image: docker.io/telegraf:1.5.2
        resources:
          limits:
            memory: 500Mi
          requests:
            cpu: 500m
            memory: 500Mi
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: "HOST_PROC"
          value: "/rootfs/proc"
        - name: "HOST_SYS"
          value: "/rootfs/sys"
        - name: ENV
          valueFrom:
            secretKeyRef:
              name: telegraf
              key: env
        - name: MONITOR_USERNAME
          valueFrom:
            secretKeyRef:
              name: telegraf
              key: monitor_username
        - name: MONITOR_PASSWORD
          valueFrom:
            secretKeyRef:
              name: telegraf
              key: monitor_password
        - name: MONITOR_HOST
          valueFrom:
            secretKeyRef:
              name: telegraf
              key: monitor_host
        - name: MONITOR_DATABASE
          valueFrom:
            secretKeyRef:
              name: telegraf
              key: monitor_database
        volumeMounts:
        - name: sys
          mountPath: /rootfs/sys
          readOnly: true
        - name: docker
          mountPath: /var/run/docker.sock
          readOnly: true
        - name: proc
          mountPath: /rootfs/proc
          readOnly: true
        - name: docker-socket
          mountPath: /var/run/docker.sock
        - name: utmp
          mountPath: /var/run/utmp
          readOnly: true
        - name: config
          mountPath: /etc/telegraf
      terminationGracePeriodSeconds: 30
      volumes:
      - name: sys
        hostPath:
          path: /sys
      - name: docker-socket
        hostPath:
          path: /var/run/docker.sock
      - name: proc
        hostPath:
          path: /proc
      - name: utmp
        hostPath:
          path: /var/run/utmp
      - name: config
        configMap:
          name: telegraf

As you can see, there are some environment variables required by the config-map and Telegraf, so I used secret to inject them. To create it, run this command, replacing the options with your needs:

kubectl create secret -n monitoring generic telegraf --from-literal=env=prod --from-literal=monitor_username=youruser --from-literal=monitor_password=yourpassword --from-literal=monitor_host=https://your.influxdb.local --from-literal=monitor_database=yourdb

There is a parameter called env that is set on prod for this example. We set this variable on every instance of Telegraf and it identifies the cluster. If you replicate environments as we do, you can create the same dashboard on Chronograf and use template variables to switch between clusters.

Now if you did everything right, you will be able to see hosts and points stored on InfluxDB and Chronograf. This is just the first phase: we now have visibility into the hosts, but we don’t know anything about the services that we are running.

Telegraf Sidecar

There are different ways to address this problem, but the one we are using is called sidecar. This term became popular recently in networking and routing mesh, but it’s similar to what we are doing.

Let’s assume that you need etcd because one of your applications uses it as storage. On k8s it will be StatefulSet like this:

apiVersion: v1
data:
  telegraf.conf: |+
    [global_tags]
      env = "$ENV"
    [[inputs.prometheus]]
      urls = ["http://localhost:2379/metrics"]
    [agent]
      hostname = "$HOSTNAME"
    [[outputs.influxdb]]
      urls = ["$MONITOR_HOST"]
      database = "mydb"
      write_consistency = "any"
      timeout = "5s"
      username = "$MONITOR_USERNAME"
      password = "$MONITOR_PASSWORD"
kind: ConfigMap
metadata:
  name: telegraf-etcd-config
  namespace: myapp
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  namespace: "myapp"
  name: "etcd"
  labels:
    component: "etcd"
spec:
  serviceName: "etcd"
  # changing replicas value will require a manual etcdctl member remove/add
  # command (remove before decreasing and add after increasing)
  replicas: 3
  template:
    metadata:
      name: "etcd"
      labels:
        component: "etcd"
    spec:
      volumes:
        - name: telegraf-etcd-config
          configMap:
            name: telegraf-etcd-config
      containers:
      - name: "telegraf"
        image: "docker.io/library/telegraf:1.4"
        volumeMounts:
          - name: telegraf-etcd-config
            mountPath: /etc/telegraf
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: MONITOR_HOST
          valueFrom:
            secretKeyRef:
              name: monitor
              key: monitor_host
        - name: MONITOR_USERNAME
          valueFrom:
            secretKeyRef:
              name: monitor
              key: monitor_username
        - name: MONITOR_PASSWORD
          valueFrom:
            secretKeyRef:
              name: monitor
              key: monitor_password
        - name: ENV
          valueFrom:
            secretKeyRef:
              name: monitor
              key: env
      - name: "etcd"
        image: "quay.io/coreos/etcd:v3.2.9"
        ports:
        - containerPort: 2379
          name: client
        - containerPort: 2380
          name: peer
        env:
        - name: CLUSTER_SIZE
          value: "3"
        - name: SET_NAME
          value: "etcd"
        command:
          - "/bin/sh"
          - "-ecx"
          - |
            IP=$(hostname -i)
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
              while true; do
                echo "Waiting for ${SET_NAME}-${i}.${SET_NAME} to come up"
                ping -W 1 -c 1 ${SET_NAME}-${i}.${SET_NAME} > /dev/null && break
                sleep 1s
              done
            done
            PEERS=""
            for i in $(seq 0 $((${CLUSTER_SIZE} - 1))); do
                PEERS="${PEERS}${PEERS:+,}${SET_NAME}-${i}=http://${SET_NAME}-${i}.${SET_NAME}:2380"
            done
            # start etcd. If cluster is already initialized the `--initial-*` options will be ignored.
            exec etcd --name ${HOSTNAME} \
              --listen-peer-urls http://${IP}:2380 \
              --listen-client-urls http://${IP}:2379,http://127.0.0.1:2379 \
              --advertise-client-urls http://${HOSTNAME}.${SET_NAME}:2379 \
              --initial-advertise-peer-urls http://${HOSTNAME}.${SET_NAME}:2380 \
              --initial-cluster-token etcd-cluster-1 \
              --initial-cluster ${PEERS} \
              --initial-cluster-state new \
              --data-dir /var/run/etcd/default.etcd

As you can see, there is a lot more. There is a config map and there are two containers deployed under the same pod: etcd and Telegraf. Containers under the same pod share the network namespace, so resolving etcd from Telegraf is as easy as calling http://localhost:2379/metrics. etcd exposes Prometheus-like metrics and you can use the Telegraf input plugin to grab them.

apiVersion: v1
data:
  telegraf.conf: |+
    [global_tags]
      env = "$ENV"
    [[inputs.prometheus]]
      urls = ["http://localhost:2379/metrics"]
    [agent]
      hostname = "$HOSTNAME"
    [[outputs.influxdb]]
      urls = ["$MONITOR_HOST"]
      database = "mydb"
      write_consistency = "any"
      timeout = "5s"
      username = "$MONITOR_USERNAME"
      password = "$MONITOR_PASSWORD"
kind: ConfigMap
metadata:
  name: telegraf-etcd-config
  namespace: myapp

Let’s assume that your Go application pushes metrics to InfluxDB using our sdk. What you can do is to deploy on the same pod as we did for etcd a telegraf that uses the http listener input plugin. This plugin is powerful because it exposes a compatible InfluxDB http layer, and when you point your app to localhost:8086 you don’t need to change anything—you will end up speaking with Telegraf without touching code.

Telegraf as a middleman between your app and InfluxDB is a plus because it will batch requests, optimizing network traffic and load on InfluxDB. Another optimization, although it requires a bit of code, is to move your application from tcp to udp. The sdk supports both methods, and you can use the socket_listener_plugin from Telegraf.

It means that your application will speak over upd to Telegraf and they share the network namespace, so packet loss will be minimized, your application will be faster, Telegraf will communicate the points over tcp to InfluxDB, and you can be sure that everything will land in InfluxDB. Bonus: If Telegraf goes down for some reason, your application won’t crash because udp doesn’t care! Your application will work as usual, but won’t store any points. If this scenario works for you, that’s great!

The benefit of using Telegraf as a sidecar to monitor distributed applications on Kubernetes is that the monitoring configuration for your services will be close to the application specification, so deployment is simple and sharing the same pod service discovery is easy, just like calling localhost.

This is usually a problem in this environment because if you have one collector, the containers change and you won’t know where they will be. Configuration can be tricky, but using this architecture, Telegraf will follow your application forever.

Percona Live Dublin recap

Gianluca Arbezzano254 (InfluxData) — Thu, 09 Nov 2017 09:00:27 -0700

On September 26, 2017 I was a speaker at Percona Live in Dublin. It was a huge event with more than 140 speakers—covering tracks about MySQL, MongoDB, Elasticsearch, MyRocks and use cases about how to successfully build or manage large databases from Cloudflare, Facebook, Percona, InfluxData.

I was in the time series track or at least that’s what I called it. Other than me speaking about the InfluxDB Internals, Daniel Lee from Grafana Labs spoke about how to build and visualize data with Grafana and a core contributor for Prometheus, Brian Brazil, and founder at Robust Perception spoke about the new Prometheus TSDM and Prometheus 2.0, and Roman Vynar from Quiq spoke about Using Prometheus with InfluxDB for Metrics Storage.

For me, it was a chance to understand how many people are currently using a Time Series Database; a lot of the attendees were managing a big MySQL or Oracle cluster and my expectation was to hear a lot of strong opinions about how simple a traditional engine can be turned to store events and time series.

Surprisingly, it didn’t happen. And a lot of people were curious about which advantages an engine designed to store a particular kind of data can create, and during my talk I spoke about some of them: compression algorithms, retention policy, and sharding. I got a lot of good feedback and impressions about this topic. My talk wasn’t recorded but InfluxData cofounder Paul Dix did and wrote the same presentation during the Carnegie Mellon University (CMU) Database Group. The video is available, and you can have a look.

Speaking about some InfluxDB users, I got some good ideas about what we the community need—like a better backup solution and more Kapacitor use cases. It was a good conference because I had the chance to speak with database administrators that are working with more traditional databases. They have a lot of good stories and scenarios that we can cover to constantly make InfluxDB easier to run and maintain.

Check out my slides from my talk.

OpenTracing: An Open Standard for Distributed Tracing

Gianluca Arbezzano254 (InfluxData) — Wed, 25 Oct 2017 09:05:44 -0700

Logs help you understand what your application is doing. Most every application generates its own logs for the server that hosts it. But in a modern distributed system and in a microservices environment, logs are just not enough.

We can use services like ElasticSearch to gather all of an application’s logs together, which is simple enough when you’re dealing with just one application. But nowadays, an application is more like a collection of services, each of which generates its own logs. Since each log records the actions and events of the service, how do we span these services to get insight on the application?

Questions like this one are a testament to why distributed tracing is becoming a requirement for effective monitoring, prompting the need for a tracing standard.

OpenTracing in Theory

Although tracing offers visibility into an application as processes grow in number, instrumenting a system for tracing has thus far been labor-intensive and complex. The OpenTracing standard changes that, enabling the instrumentation of applications for distributed tracing with minimal effort.

In October 2016, OpenTracing became a project under the guidance of the Cloud Native Computing Foundation. Under the CNCF’s stewardship, OpenTracing aims to be an open, vendor-neutral standard for distributed systems instrumentation. It offers a way for developers to follow the thread — to trace requests from beginning to end across touchpoints and understand distributed systems at scale.

In OpenTracing, a trace tells the story of a transaction or workflow as it propagates through a distributed system. The concept of the trace borrows a tool from the scientific community called a directed acyclic graph (DAG), which stages the parts of a process from a clear start to a clear end. Certain groups of steps or spans in-between may be repeatable, but never indefinitely like a “do loop” without an exit condition.

So a trace in this context is a DAG made up of spans — named, timed operations representing contiguous segments of work in that trace. Each component in a distributed trace will contribute its own spans.

Now that you have a bit of background, the following definitions should make more sense: A trace is a set of spans that share a common root. For OpenTracing, a trace is built by collecting all spans that share a TraceId.

In this instance, a span is a set of annotations that correspond to a particular remote procedure call. Each span represents a unit of time and has its own log. The span context is a key/value store that is attachable to a specific span, to which you may log on to better understand the events to which the span refers. Basically, tracing is about spans, inter-process propagation, and active span management.

Why OpenTracing Adoption Is Growing

In microservices architectures, there are more applications communicating with each other than ever before. While application performance monitoring is great for debugging inside a single app, as a system expands into multiple services, how can you understand how much time each service is taking, where the exception happens, and the overall health of your system? In particular, how do you measure network latency between services—such as how long a request takes between one app to another?

Enter distributed tracing instrumentation. With the higher-level distribution of services that takes place in a cloud-based environment, tracing will become a key part of the cloud infrastructure supporting those services.

If you’ve ever used the Firefox browser in development, you know that when you open its Browser Console (Ctrl + Shift + J), you can see all the components currently being executed in the cache, and their current operating status. In a sense, that’s a kind of trace.

Hopefully, the need for a tracing standard for server-side services is as obvious as the need for one on the client side. We need OpenTracing specifically because there are different languages and different libraries, each of which may use its own instrumentation, may send different data and may access its own database. So you rarely have a single trace from any single component. This fact is what gives rise to the need for a common language for the instrumentation of application code, library code, and all kinds of systems.

OpenTracing Use Cases

The OpenTracing documentation offers this candidate for a common definition for tracing: “a thin standardization layer that sits between application/library code and various systems that consume tracing and causality data.” As a standard mechanism for describing system behavior, OpenTracing would thereby serve as a way for applications, libraries, services, and frameworks to “describe and propagate distributed traces without knowledge of the underlying OpenTracing implementation.” Here’s where its value resides.

As discussed on GitHub, common use cases of OpenTracing include:

Microservices — for instance, reconstructing the journey that transactions take through a microservices architecture.
Caching — troubleshooting to determine whether a request is hitting the cache.
Arbitration — for example, tracing the full history of a single process and determining its behavior when multiple services contact it in parallel rather than sequentially.
Message bus monitoring — determining the proper spans and distributions of messages in a queue, to ensure they're triggering the proper series of events, and also to make certain messages are brief, discrete, and never the sources of data leaks.

What InfluxData Is Doing with OpenTracing

Recognizing the need to simplify troubleshooting in microservice platforms, InfluxData decided to add added tracing functionality into its Zipkin Telegraf plugin. Zipkin is a distributed tracing system that helps gather timing data needed to troubleshoot latency problems common with microservices.

Zipkin uses Cassandra as a backstore for all its traces. We discovered it would be useful for Telegraf to collect the traces, then store their data into InfluxDB, a native Time Series Database. Since these traces are all timestamped, InfluxDB is a better choice for storing them. It’s optimized for time series data and built from the ground up for metrics and events. If you are already storing metrics in InfluxDB, it makes sense to store your traces there too, especially because you can then manipulate/cross-analyze traces with other metrics using Kapacitor, InfluxData’s native processing engine.

At InfluxData, we use what we build. To validate our own theories, we’re implementing OpenTracing in our InfluxDB Cloud service. We’ll soon be sharing some of the details on the implementations, as well as how it has helped us in troubleshooting.

OpenTracing is getting much attention from companies because developers want to know what’s happening in their applications. Through OpenTracing, developers are able to understand where each request started, where it is going, and what’s happening across its journey. Having more knowledge lets them take the appropriate action.

You can learn more by watching this recent InfluxData OpenTracing webinar.