Intel Leverages Telegraf to Deliver Platform Visibility
Jul 14, 2023
This article was written by Pawel Żak, Telegraf Maintainer and Project Technical Lead at Intel. Scroll down to view the author’s profile.
Since 2020, the Intel team has been contributing to Telegraf, including both telemetry from Intel-specific platform features (such as Intel® Resource Director Technology, Intel® Dynamic Load Balancer, or power statistics from Intel-based platforms) and telemetry gathered from generic tools and frameworks; for example, Data Plane Development Kit (DPDK), Libvirt, P4 Runtime, or Reliability Availability Serviceability (RAS).
Indirectly (through Intel® Platform Telemetry Insights) or directly, companies utilizing the telemetry we provide use InfluxData products.
Intel® Platform Telemetry Insights uses Telegraf as a base source for metrics, to report software, translate platform telemetry into networking and operational data, and provide insights on platform reliability, utilization, congestion, and configuration issues. These insights can be used to notify NetOps and trigger remediation actions as part of an observability solution in closed loop systems. You can find more details here.
Here is a brief summary of the Telegraf plugins we delivered:
Intel PowerStat Input Plugin
This plugin monitors power statistics on Intel-based platforms. These statistics are crucial for monitoring and analytics systems to take preventive/corrective actions based on platform busyness, CPU temperature, actual CPU utilization, and power statistics. You can also use these metrics to monitor power consumption to help make decisions on how best to save energy.
Main use cases for these systems are:
- Sustainability/power savings
- Workload migration and smart workload placement
- Overload/congestion detection
- Power consumption anomaly detection
Metrics exposed by the Intel PowerStat Input Plugin:
- Percentage of time that CPU Core spent in C0/C1/C6 Core residency states
- Current operational frequency of CPU Core
- Current temperature of CPU Core
- CPU Core Busy Frequency measured as frequency adjusted to CPU Core busy cycles
- Current operational frequency of CPU Core
- Current power consumption of processor package and processor package DRAM subsystem
- Maximum reachable turbo frequency for number of cores active
- Minimum and maximum uncore frequency limits for die in processor package
- Current uncore frequency for die in processor package
- CPU Base Frequency (maximum non-turbo frequency) for the processor package
Intel® Resource Director Technology (Intel® RDT) Input Plugin
This plugin collects information provided by monitoring features of the Intel® Resource Director Technology. Intel® RDT provides a framework with several component features for cache and memory monitoring and allocation capabilities, including Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Code and Data Prioritization (CDP), Memory Bandwidth Monitoring (MBM), and Memory Bandwidth Allocation (MBA). These technologies enable tracking and control of shared resources, such as the Last Level Cache (LLC) and main memory (DRAM) bandwidth, in use by many applications, containers, or VMs running on the platform concurrently. Intel® RDT may aid “noisy neighbor” detection and help to reduce performance interference, ensuring the performance of key workloads in complex environments.
Metrics exposed by Intel® RDT Input Plugin:
- Memory bandwidth utilization by the relevant CPU core/process on the local NUMA memory channel
- Memory bandwidth utilization by the relevant CPU core/process on the remote NUMA memory channel
- Total memory bandwidth utilized by a CPU core/process on local and remote NUMA memory channels
- Total Last Level Cache occupancy by a CPU core/process
- Total Last Level Cache misses by a CPU core/process
- Total instructions per cycle executed by a CPU core/process
Intel Performance Monitoring Unit Plugin
This input plugin exposes Intel PMU (Performance Monitoring Unit) metrics available through Linux Perf subsystem. PMU metrics provide insight into the performance and health of IA processor’s internal components, including core and uncore units. With the number of cores increasing and processor topology getting more complex, insight into those metrics is vital to assure the best CPU performance and utilization.
Performance counters are CPU hardware registers that count hardware events, such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots. You can find a full list of events for specific architectures here.
This plugin enables measuring both core and uncore events. Each single core event may be counted severally on every available CPU’s core. In contrast, uncore events could be placed in many PMUs within a specified CPU package. The plugin allows choosing core ids (core events) or socket ids (uncore events) on which the counting should be executed. Uncore events are separately activated on all socket’s PMUs, and can be exposed as separate measurements or summed up as one measurement.
The following counters are exposed for each activated core or uncore event:
- enabled - time counter, contains time the associated perf event was enabled
- running - time counter, contains time the event was actually counted
- raw - value counter, contains event count value during the time the event was actually counted
- scaled - value counter, contains approximated value of counter if the event was continuously counted, using a scaled = raw * (enabled / running) formula
RAS Daemon Input Plugin
This plugin gathers and counts metrics for Machine Check Errors provided by RASDaemon. Platform Reliability, Availability, and Serviceability (part of Intel® Run Sure Technology) are key functionalities used in modern Data Centers to assure required Quality of Service for customer workloads. A basic RAS requirement is to offer platform telemetry related to RAS, which allows Orchestrator/Data Center administrators to monitor the health of the platform and take corrective/preventive actions when necessary.
Counters exposed by RAS Daemon Input Plugin:
- Corrected Errors count on all memory controllers during Read operation
- Uncorrectable Errors count on all memory controllers during Read operation (including Recoverable and Fatal errors)
- Corrected Errors count on all memory controllers during Write operation
- Uncorrectable Errors count on all memory controllers during Write operation (including Recoverable and Fatal errors)
- Errors count on Instruction & Data Cache Level0, Level1
- Errors count on TLB Level0, Level1
- Errors count related to Level 2 Cache
- Intel® UPI Errors
- Counter of base processor errors as reported by MCE (simple error codes)
- Counter of BUS Errors as reported by processor via MCE
- Counter of internal timer errors reported by MCE
- Counter of SMM Handler Code Access Violations reported by MCE when SMM handler attempts to execute outside the ranges specified by SMRR
- Counter of internal parity errors reported by MCE
- Counter of Functional Redundancy Check errors reported by MCE
- Counter of external errors as reported by MCE (caused by Bus Init from another processor)
- Counter of Parity errors in internal microcode ROM reported by MCE
- Counter of unclassified errors as returned by MCE
Intel® Dynamic Load Balancer (Intel® DLB) Input Plugin
This plugin collects metrics exposed by applications built with Data Plane Development Kit (DPDK), which is an extensive set of open source libraries designed for accelerating packet processing workloads. The plugin also uses a bifurcated driver. More specifically it’s targeted for applications that use Intel® DLB as eventdev devices accessed via bifurcated driver (allowing access from kernel and user-space).
The Intel® Dynamic Load Balancer (Intel® DLB) is a PCIe device that provides load-balanced, prioritized scheduling of events (that is, packets) across CPU cores enabling efficient core-to-core communication. It is a hardware accelerator located inside the latest Intel® Xeon® devices offered by Intel. It supports the event-driven programming model of DPDK’s Event Device Library. This library is used in packet processing pipelines for multi-core scalability, dynamic load-balancing, and a variety of packet distribution and synchronization schemes.
Some key metrics provided by this plugin include:
- statistics for an eventdev
- statistics for an eventdev port
- statistics for an eventdev queue
- list of queues linked with a specified eventdev port and a service priority associated with each link
Data Plane Development Kit (DPDK) Input Plugin
This plugin collects metrics exposed by applications built with Data Plane Development Kit (DPDK), which is an extensive set of open source libraries designed for accelerating packet processing workloads.
DPDK provides APIs that enable exposing various statistics from the devices used by DPDK applications and enable exposing KPI metrics directly from applications. Device statistics include common statistics available across NICs, like received and sent packets, received and sent bytes, etc. In addition to these generic statistics, an extended statistics API is available that provides more detailed, driver-specific metrics that are not available as generic statistics.
Some key metrics provided by this plugin include:
- basic device statistics
- extended device statistics
- up/down link status
- application-specific metrics
Intel Baseband Accelerator Input Plugin
This plugin collects metrics from both dedicated and integrated Intel devices that provide Wireless Baseband hardware acceleration. These devices play a key role in accelerating 5G and 4G Virtualized Radio Access Networks (vRAN) workloads, increasing the overall compute capacity of commercial, off-the-shelf platforms.
- Intel® vRAN Boost integrated accelerators:
- 4th Gen Intel® Xeon® Scalable processor with Intel® vRAN Boost (also known as Sapphire Rapids Edge Enhanced / SPR-EE)
- External expansion cards connected to the PCI bus:
- Intel® vRAN Dedicated Accelerator ACC100 SoC (code named Mount Bryce)
Intel Baseband devices and integrate various features critical for 5G and LTE (Long Term Evolution) networks, including e.g.:
- Forward Error Correction (FEC) processing,
- 4G Turbo FEC processing,
- 5G Low Density Parity Check (LDPC)
- a Fast Fourier Transform (FFT) block providing DFT/iDFT processing offload for the 5G Sounding Reference Signal (SRS)
- Exposed metrics contain information about:
- type of metric: “code_blocks”, “data_bytes”, “per_engine” and its value
- type of operation: “5GUL”, “5GDL”, “4GUL”, “4GDL”, “FFT”
- virtual Function number
- engine number
Libvirt Input Plugin
This plugin collects statistics about virtualized guests on a system via libvirt API, created by RedHat’s Emerging Technology group. Metrics are gathered directly from the hypervisor on a host system, which means that Telegraf doesn’t have to be installed and configured on a guest system.
All available metrics for the following statistics groups can be exposed:
- vcpu_mapping - list of physical CPUs mapped to particular VCPU
P4 Runtime Input Plugin
This plugin gathers metrics about Counter values present in P4 Program loaded onto a networking device. Metrics are collected through gRPC connection with P4Runtime server.
P4 is a language for programming the data plane of network devices, such as Programmable Switches or Programmable Network Interface Cards. The P4Runtime API is a control plane specification to manage the data plane elements of those devices dynamically by a P4 program.
The following metrics are exposed for all counters in P4 programs:
- number of bytes gathered in counter
- number of packets gathered in counter
- index at which metrics are collected in P4 counter
S.M.A.R.T. Input Plugin
(Telegraf 1.5.0+, metrics for Intel NVMe devices available since 1.16.0+)
S.M.A.R.T. is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of anticipating hardware failures.
The plugin supports gathering metrics from
smartmontools tool. We added the possibility to also gather statistics from NVMe drives (using nvme-cli tool), including vendor-specific attributes.
Intel specific attributes are exposed for NVMe devices:
- Program Fail Count
- Erase Fail Count
- Wear Leveling Count
- End to End Error Detection Count
- CRC Error Count
- Timed Workload, Media Wear
- Timed Workload, Host Reads
- Timed Workload, Timer
- Thermal Throttle Status
- Retry Buffer Overflow Counter
- PLL Lock Loss Count
- NAND Bytes Written
- Host Bytes Written
About the author
Paweł Żak is a Telegraf Maintainer and Project Technical Lead at Intel Corporation where he is responsible for enabling Intel telemetry for easy consumption. He holds a Master of Science in Computer Science from the Gdansk University of Technology, Poland. When he is not behind a screen, you can find him running, riding a bike, or orienteering.