Ceph is a free-software storage platform, implementing object storage on a single distributed computer cluster, and providing interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, to be scalable to the exabyte level, and freely available.

Why use a Telegraf plugin for Ceph Storage?

Ceph uniquely delivers object, block, and file storage in one unified system. Ceph has become popular for being open source and free to use, and is favored by Kubernetes users for being highly reliable and easy to manage. Ceph delivers extraordinary scalability:

  • A Ceph Node leverages commodity hardware and intelligent daemons.
  • A Ceph Storage Cluster accommodates large numbers of nodes that communicate with each other to replicate and redistribute data dynamically.

The Ceph Storage Cluster receives data from Ceph Clients – whether it comes through a Ceph Block Device, Ceph Object Storage, the Ceph Filesystem, or a custom implementation you create using librados – and stores the data as objects.

Monitoring your Ceph Storage infrastructure is as important as monitoring the containers that your applications run in. You can use the Ceph Storage Telegraf Plugin to collect metrics that will help you with monitoring your Ceph Storage infrastructure.

How to monitor your Ceph Storage infrastructure using the Ceph Storage Telegraf Plugin

Configuring the Ceph Storage Telegraf Plugin is simple. Configure the location, the directory, and the prefix of MON and OSD socket files to determine socket type. Once configured, it will collect performance metrics from the MON and OSD nodes in a Ceph Storage cluster.

Admin Socket Stats

This gatherer works by scanning the configured SocketDir for OSD, MON, MDS and RGW socket files. When it finds a MON socket, it runs ceph --admin-daemon $file perfcounters_dump. For OSDs it runs ceph --admin-daemon $file perf dump.

The resulting JSON is parsed and grouped based on a top-level key. Top-level keys are used as collection tags, and all sub-keys are flattened. For example:

"paxos": {
"refresh": 9363435,
"refresh_latency": {
"avgcount": 9363435,
"sum": 5378.794002000

Would be parsed into the following metrics, all of which would be tagged with collection=paxos:

  • refresh = 9363435
  • refresh_latency.avgcount: 9363435
  • refresh_latency.sum: 5378.794002000
Cluster Stats

This gatherer works by invoking Ceph commands against the cluster and thus only requires the ceph client, valid ceph configuration and an access key to function (the ceph_config and ceph_user configuration variables work in conjunction to specify these prerequisites). It may be run on any server you wish that has access to the cluster. The currently supported commands are:

  • ceph status
  • ceph df
  • ceph osd pool stats

Key Ceph Storage metrics to use for monitoring

Some of the important Ceph Storage metrics that you should proactively monitor include:

  • Ceph cluster health status
  • Quorum of online monitor nodes
  • Status of OSD nodes (whether down but in)
  • Reaching capacity status of whole cluster or some nodes

Recomended ways to install Ceph

There are several different ways to install Ceph. The officially recommended ways are to use either Cephadm to deploy and manage a Ceph cluster by connection to hosts from the manager daemon via SSH, or to use Rook. Rook is a set of Storage Operators for Kubernetes. It deploys and manages Ceph clusters running in Kubernetes, while also enabling management of storage resources and provisioning via Kubernetes APIs.

When it comes to monitoring the cluster, Ceph offers deployment of the whole monitoring stack including Prometheus, Prometheus exporters, Alert manager etc. In this case, using Prometheus Input Plugin in Telegraf seems to be more appropriate to connect and collect all Prometheus metrics from Ceph Manager module service endpoint running in Kubernetes.

For more information, please check out the documentation.

Project URL   Documentation

Related Resources

Kubernetes Monitoring

Read this blueprint to gain real-time visibility into your entire container-based environment to unify all your metrics and events for faster root cause analysis.

Application Performance Monitoring (APM)

APM helps you maintain a flawless user experience with responsive applications in a dynamic application environment of continuous integration and delivery.

Kubernetes Monitoring Template

Try this Kubernetes Template in your InfluxDB Cloud instance to quickly start ensuring that your Kubernetes clusters utilize resources efficiently.

Scroll to Top