Ceph Storage Platform Monitoring

Use This InfluxDB Integration for Free

Ceph is a free-software storage platform, implementing object storage on a single distributed computer cluster, and providing interfaces for object-, block- and file-level storage. Ceph aims primarily for completely distributed operation without a single point of failure, to be scalable to the exabyte level, and freely available.

With regards to object storage, Ceph is built to provide users with seamless access to all critical objects in their native language bindings (or radosgw if they prefer). It also offers a REST interface that is natively compatible with applications written for both S3 and Swift. In terms of block storage, Ceph's RADOS Block Device (also known as the RBD) offers unrestricted access to all block device images in an environment that are stripped and replicated across the entirety of the storage cluster in question.

Finally, Ceph offers users a fully POSIX-compliant network file system (called CephFS) that delivers high performance, large volumes of data storage, and the ability to maximize compatibility with legacy applications, all at the exact same time.

Why use a Telegraf plugin for Ceph Storage?

Ceph uniquely delivers object, block, and file storage in one unified system. Ceph has become popular for being open source and free to use, and is favored by Kubernetes users for being highly reliable and easy to manage. Ceph delivers extraordinary scalability:

  • A Ceph Node leverages commodity hardware and intelligent daemons.
  • A Ceph Storage Cluster accommodates large numbers of nodes that communicate with each other to replicate and redistribute data dynamically.

The Ceph Storage Cluster receives data from Ceph Clients - whether it comes through a Ceph Block Device, Ceph Object Storage, the Ceph Filesystem, or a custom implementation you create using librados - and stores the data as objects.

Monitoring your Ceph Storage infrastructure is as important as monitoring the containers that your applications run in. You can use the Ceph Storage Telegraf Plugin to collect metrics that will help you with monitoring your Ceph Storage infrastructure.

In addition to allowing you to check the health status of your environment at a moment's notice, the Ceph Telegraf Plugin also gives users the ability to immediately learn when online monitor nodes don't reach quorum. This can help avoid a deadlock, which is the type of disruptive event you definitely want to avoid. Likewise, monitoring Ceph will alert you to situations that need immediate attention like when OSD nodes are down but still appear to be participating if they remain in the status for more than five consecutive minutes. In that situation, Ceph is likely having issues recovering from the node loss, and monitoring can help you get things back up and running as quickly as possible.

If you are running a small Ceph cluster for some type of non-essential application, you can likely get by with the built-in monitoring tools that come with it. If you're running it as a part of a production environment, however, you'll want the robust monitoring capabilities that only the Telegraf plugin for Ceph storage can offer.

How to monitor your Ceph Storage infrastructure using the Ceph Storage Telegraf Plugin

Configuring the Ceph Storage Telegraf Plugin is simple. Configure the location, the directory, and the prefix of MON and OSD socket files to determine socket type. Once configured, it will collect performance metrics from the MON and OSD nodes in a Ceph Storage cluster.

Admin Socket Stats

This gatherer works by scanning the configured SocketDir for OSD, MON, MDS and RGW socket files. When it finds a MON socket, it runs ceph --admin-daemon $file perfcounters_dump. For OSDs it runs ceph --admin-daemon $file perf dump.

The resulting JSON is parsed and grouped based on a top-level key. Top-level keys are used as collection tags, and all sub-keys are flattened. For example:

   "paxos": {
     "refresh": 9363435,
     "refresh_latency": {
       "avgcount": 9363435,
       "sum": 5378.794002000

Would be parsed into the following metrics, all of which would be tagged with collection=paxos:

  • refresh = 9363435
  • refresh_latency.avgcount: 9363435
  • refresh_latency.sum: 5378.794002000
Cluster Stats

This gatherer works by invoking Ceph commands against the cluster and thus only requires the ceph client, valid ceph configuration and an access key to function (the ceph_config and ceph_user configuration variables work in conjunction to specify these prerequisites). It may be run on any server you wish that has access to the cluster. The currently supported commands are:

  • ceph status
  • ceph df
  • ceph osd pool stats

Key Ceph Storage metrics to use for monitoring

Some of the important Ceph Storage metrics that you should proactively monitor include:

  • Ceph cluster health status
  • Quorum of online monitor nodes
  • Status of OSD nodes (whether down but in)
  • Reaching capacity status of whole cluster or some nodes

Recomended ways to install Ceph

There are several different ways to install Ceph. The officially recommended ways are to use either Cephadm to deploy and manage a Ceph cluster by connection to hosts from the manager daemon via SSH, or to use Rook. Rook is a set of Storage Operators for Kubernetes. It deploys and manages Ceph clusters running in Kubernetes, while also enabling management of storage resources and provisioning via Kubernetes APIs.

When it comes to monitoring the cluster, Ceph offers deployment of the whole monitoring stack including Prometheus, Prometheus exporters, Alert manager etc. In this case, using Prometheus Input Plugin in Telegraf seems to be more appropriate to connect and collect all Prometheus metrics from Ceph Manager module service endpoint running in Kubernetes.

For more information, please check out the documentation.

Project URL   Documentation

Related resources


The most powerful time series
database as a service

Get Started for Free

Developer Education

Training for time series app developers.

View All Education