Customer Success Story: NetApp
NetApp uses InfluxDB for real-time resource trending, SLO/SLI calculations, and alerting. The SRE team relies very heavily on the ability to identify trends in resource consumption for critical Linux servers within their infrastructure, DB monitoring, and custom resource monitoring. The company has been using TICKscript for downsampling and alerting, but is now starting to look into using Flux.
The company has found that InfluxDB has a high ingest, it integrates well with other tools, and is extremely performant. They are able to monitor multiple systems efficiently and integrate with Grafana, which is their preferred method of displaying dashboards. They have also found the Slack integration to be very useful since that is what the team uses for communication across the globe. If they have an alert triggered via data that they are storing in InfluxDB, their team members in India can see it at the same time as in the US, allowing the company to coordinate quick responses.
Lead Site Reliability Engineer Dustin Sorge likes that InfluxDB is highly effective for storing and processing time series data. For the SRE team, time series data has allowed them to efficiently detect trends that can lead to failure conditions within their environment. The system data that is collected via Telegraf is also useful when investigating failure conditions (trends in memory usage, CPU usage, etc.) which is key to the SRE postmortem process.
Sorge recommends checking out the Slack integrations. NetApp is currently using Kapacitor to alert Slack via Webhooks. This empowers their globally distributed team to function seamlessly with the foundation of time series data stored in InfluxDB. Sorge is also looking forward to checking out Starlark within Telegraf.