When it comes to infrastructure monitoring, many people monitor processor temperature but fail to check hard drive temperature. Paying attention to the temperature of your hard drive is important because overheated hard drives may cause data loss, data corruption, computer crashes, or even hard disk failure. While other components like a malfunctioned processor or graphics card can be replaced, hard drive failure can be more costly and could lead to the loss of important files which may no longer be recoverable.
What is Hddtemp?
Hddtemp is a Linux feature that measures the temperature of the hard drive via Self-Monitoring, Analysis and Reporting Technology (S.M.A.R.T). Please note that not all drives support this tool.
Metrics are collected (temperature, number of reallocated sectors, seek errors...) to measure the health, predict possible failures, and provide notifications on unsafe values.
Why use the Hddtemp Telegraf Plugin?
The hddtemp utility comes standard with several PATA, SATA or SCSI hard drives to monitor and report hard drive temperature, so using this Hddtemp Telegraf Plugin is an easy way to capture hard drive temperature metrics as a part of your infrastructure monitoring stack. You can select which drive to collect these metrics against and compare it to other metrics about the hard drive's performance from a number of other Telegraf plugins to give you a holistic picture of the hard drive's overall health.
How to monitor Hddtemp using the Telegraf plugin
The Hddtemp Telegraf Input Plugin reads data from hddtemp daemons. By default, the plugin gathers temperature data from all disks detected by hddtemp. You can configure it to only collect temps from the selected disks. Because this plugin reads data from the hddtemp daemon, hddtemp should be installed and its daemon running. Hddtemp requires root privileges, and the command hddtemp must be followed by at least one drive's location. You can list several drives separated by spaces.
Key Hddtemp metrics to use for monitoring
Hddtemp collects temperature on specified devices with the following metrics that will help you identify potential issues: