Coming soon! Our webinar just ended. Check back soon to watch the video.
Reduce SRE Stress: Minimizing Service Downtime with Grafana, InfluxDB and Telegraf
Session date: 2022-07-19 08:00:00 (Pacific Time)
NetApp is a global cloud-led, data-centric software company. They are an industry leader in hybrid cloud data services and data management solutions. Their platform enables their customers to store and share large quantities of digital data across physical and hybrid cloud environments. NetApp Engineering’s Site Reliability Engineering team is tasked with supporting their internal build environment, test, and automation infrastructure. After collecting their time-stamped data in InfluxDB, they are using Kapacitor to push alerts directly to Slack via webhooks. Their globally distributed SRE team are able to seamlessly collaborate and troubleshoot. Discover how NetApp uses a time series platform to detect trends in real time that can result in failures within their environments, and to provide key metrics used in SRE postmortems.
Join this webinar as Dustin Sorge dives into:
- NetApp’s approach to monitoring their SRE team’s metrics — including SLO’s and SLI’s
- Their best practices and techniques for monitoring memory usage and CPU usage
- How they use InfluxDB and Telegraf to detect trends and coordinate fixes faster.
Lead Site Reliability Engineer, NetApp
Dustin currently resides in Pittsburgh, Pennsylvania and is the Site Reliability Engineering Technical Lead for NetApp’s ONTAP Engineering organization. His team has been using InfluxDB for 4+ years and continues to leverage it for the support of critical services. He is a proud alumni of both the University of Pittsburgh and Carnegie Mellon University. Prior to joining NetApp, he was a High Performance Computing Operations Engineer and Software Engineer for the Pittsburgh Supercomputing Center.