We often get asked the question about what metrics you should collect and act on for the various products that we support with our Telegraf plugins. Of course, we tell people that it will depend on the product itself (is it a webserver where you could collect metrics about throughput, latency or maybe a database where you want to track capacity or latency). But responding to a customer’s question with a question isn’t really that useful and I am happy to let you know that our friends at Graylog wrote a nice blog on the topic of monitoring your Graylog server with InfluxData to keep it operational. This is especially important since the whole point of keeping a Graylog server operational is to keep your log data available and accessible to help you discover and resolve issues!
In particular, they recommend that you review the 5 core metrics (Disk IO utilization, available disk space, CPU usage, memory usage, available file descriptors) to ensure your system is in an acceptable state. They also mention that the most important of the 5 is available disk space, since you could run into some serious trouble if you do.
Take a look at their blog and try gathering metrics with InfluxDB and the Graylog Telegraf plugin for yourself! We think you will be pleased with the results!