Intel Leverages Telegraf to Deliver Platform Visibility

By Community / Developer
Jul 14, 2023

Navigate to:

This article was written by Pawel Żak, Telegraf Maintainer and Project Technical Lead at Intel. Scroll down to view the author’s profile.

Since 2020, the Intel team has been contributing to Telegraf, including both telemetry from Intel-specific platform features (such as Intel® Resource Director Technology, Intel® Dynamic Load Balancer, or power statistics from Intel-based platforms) and telemetry gathered from generic tools and frameworks; for example, Data Plane Development Kit (DPDK), Libvirt, P4 Runtime, or Reliability Availability Serviceability (RAS).

Indirectly (through Intel® Platform Telemetry Insights) or directly, companies utilizing the telemetry we provide use InfluxData products.

Intel® Platform Telemetry Insights uses Telegraf as a base source for metrics, to report software, translate platform telemetry into networking and operational data, and provide insights on platform reliability, utilization, congestion, and configuration issues. These insights can be used to notify NetOps and trigger remediation actions as part of an observability solution in closed loop systems. You can find more details here.

Here is a brief summary of the Telegraf plugins we delivered:

Intel PowerStat Input Plugin

(Telegraf 1.17.0+)

This plugin monitors power statistics on Intel-based platforms. These statistics are crucial for monitoring and analytics systems to take preventive/corrective actions based on platform busyness, CPU temperature, actual CPU utilization, and power statistics. You can also use these metrics to monitor power consumption to help make decisions on how best to save energy.

Main use cases for these systems are:

Sustainability/power savings
Workload migration and smart workload placement
Overload/congestion detection
Power consumption anomaly detection

Metrics exposed by the Intel PowerStat Input Plugin:

Percentage of time that CPU Core spent in C0/C1/C6 Core residency states
Current operational frequency of CPU Core
Current temperature of CPU Core
CPU Core Busy Frequency measured as frequency adjusted to CPU Core busy cycles
Current operational frequency of CPU Core
Current power consumption of processor package and processor package DRAM subsystem
Maximum reachable turbo frequency for number of cores active
Minimum and maximum uncore frequency limits for die in processor package
Current uncore frequency for die in processor package
CPU Base Frequency (maximum non-turbo frequency) for the processor package

Intel® Resource Director Technology (Intel® RDT) Input Plugin

(Telegraf 1.16.0+)

This plugin collects information provided by monitoring features of the Intel® Resource Director Technology. Intel® RDT provides a framework with several component features for cache and memory monitoring and allocation capabilities, including Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Code and Data Prioritization (CDP), Memory Bandwidth Monitoring (MBM), and Memory Bandwidth Allocation (MBA). These technologies enable tracking and control of shared resources, such as the Last Level Cache (LLC) and main memory (DRAM) bandwidth, in use by many applications, containers, or VMs running on the platform concurrently. Intel® RDT may aid “noisy neighbor” detection and help to reduce performance interference, ensuring the performance of key workloads in complex environments.

Metrics exposed by Intel® RDT Input Plugin:

Memory bandwidth utilization by the relevant CPU core/process on the local NUMA memory channel
Memory bandwidth utilization by the relevant CPU core/process on the remote NUMA memory channel
Total memory bandwidth utilized by a CPU core/process on local and remote NUMA memory channels
Total Last Level Cache occupancy by a CPU core/process
Total Last Level Cache misses by a CPU core/process
Total instructions per cycle executed by a CPU core/process

Intel Performance Monitoring Unit Plugin

(Telegraf 1.21.0+)

This input plugin exposes Intel PMU (Performance Monitoring Unit) metrics available through Linux Perf subsystem. PMU metrics provide insight into the performance and health of IA processor’s internal components, including core and uncore units. With the number of cores increasing and processor topology getting more complex, insight into those metrics is vital to assure the best CPU performance and utilization.

Performance counters are CPU hardware registers that count hardware events, such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots. You can find a full list of events for specific architectures here.

This plugin enables measuring both core and uncore events. Each single core event may be counted severally on every available CPU’s core. In contrast, uncore events could be placed in many PMUs within a specified CPU package. The plugin allows choosing core ids (core events) or socket ids (uncore events) on which the counting should be executed. Uncore events are separately activated on all socket’s PMUs, and can be exposed as separate measurements or summed up as one measurement.

The following counters are exposed for each activated core or uncore event:

enabled - time counter, contains time the associated perf event was enabled
running - time counter, contains time the event was actually counted
raw - value counter, contains event count value during the time the event was actually counted
scaled - value counter, contains approximated value of counter if the event was continuously counted, using a scaled = raw * (enabled / running) formula

RAS Daemon Input Plugin

(Telegraf 1.16.0+)

This plugin gathers and counts metrics for Machine Check Errors provided by RASDaemon. Platform Reliability, Availability, and Serviceability (part of Intel® Run Sure Technology) are key functionalities used in modern Data Centers to assure required Quality of Service for customer workloads. A basic RAS requirement is to offer platform telemetry related to RAS, which allows Orchestrator/Data Center administrators to monitor the health of the platform and take corrective/preventive actions when necessary.

Counters exposed by RAS Daemon Input Plugin:

Corrected Errors count on all memory controllers during Read operation
Uncorrectable Errors count on all memory controllers during Read operation (including Recoverable and Fatal errors)
Corrected Errors count on all memory controllers during Write operation
Uncorrectable Errors count on all memory controllers during Write operation (including Recoverable and Fatal errors)
Errors count on Instruction & Data Cache Level0, Level1
Errors count on TLB Level0, Level1
Errors count related to Level 2 Cache
Intel® UPI Errors
Counter of base processor errors as reported by MCE (simple error codes)
Counter of BUS Errors as reported by processor via MCE
Counter of internal timer errors reported by MCE
Counter of SMM Handler Code Access Violations reported by MCE when SMM handler attempts to execute outside the ranges specified by SMRR
Counter of internal parity errors reported by MCE
Counter of Functional Redundancy Check errors reported by MCE
Counter of external errors as reported by MCE (caused by Bus Init from another processor)
Counter of Parity errors in internal microcode ROM reported by MCE
Counter of unclassified errors as returned by MCE

Intel® Dynamic Load Balancer (Intel® DLB) Input Plugin

(Telegraf 1.25.0+)

This plugin collects metrics exposed by applications built with Data Plane Development Kit (DPDK), which is an extensive set of open source libraries designed for accelerating packet processing workloads. The plugin also uses a bifurcated driver. More specifically it’s targeted for applications that use Intel® DLB as eventdev devices accessed via bifurcated driver (allowing access from kernel and user-space).

The Intel® Dynamic Load Balancer (Intel® DLB) is a PCIe device that provides load-balanced, prioritized scheduling of events (that is, packets) across CPU cores enabling efficient core-to-core communication. It is a hardware accelerator located inside the latest Intel® Xeon® devices offered by Intel. It supports the event-driven programming model of DPDK’s Event Device Library. This library is used in packet processing pipelines for multi-core scalability, dynamic load-balancing, and a variety of packet distribution and synchronization schemes.

Some key metrics provided by this plugin include:

statistics for an eventdev
statistics for an eventdev port
statistics for an eventdev queue
list of queues linked with a specified eventdev port and a service priority associated with each link

Data Plane Development Kit (DPDK) Input Plugin

(Telegraf 1.19.0+)

DPDK provides APIs that enable exposing various statistics from the devices used by DPDK applications and enable exposing KPI metrics directly from applications. Device statistics include common statistics available across NICs, like received and sent packets, received and sent bytes, etc. In addition to these generic statistics, an extended statistics API is available that provides more detailed, driver-specific metrics that are not available as generic statistics.

Some key metrics provided by this plugin include:

basic device statistics
extended device statistics
up/down link status
application-specific metrics

Intel Baseband Accelerator Input Plugin

(Telegraf 1.27.0+)

This plugin collects metrics from both dedicated and integrated Intel devices that provide Wireless Baseband hardware acceleration. These devices play a key role in accelerating 5G and 4G Virtualized Radio Access Networks (vRAN) workloads, increasing the overall compute capacity of commercial, off-the-shelf platforms.

Supported hardware:

Intel® vRAN Boost integrated accelerators:
- 4th Gen Intel® Xeon® Scalable processor with Intel® vRAN Boost (also known as Sapphire Rapids Edge Enhanced / SPR-EE)
External expansion cards connected to the PCI bus:
- Intel® vRAN Dedicated Accelerator ACC100 SoC (code named Mount Bryce)

Intel Baseband devices and integrate various features critical for 5G and LTE (Long Term Evolution) networks, including e.g.:

Forward Error Correction (FEC) processing,
4G Turbo FEC processing,
5G Low Density Parity Check (LDPC)
a Fast Fourier Transform (FFT) block providing DFT/iDFT processing offload for the 5G Sounding Reference Signal (SRS)
Exposed metrics contain information about:
type of metric: “code_blocks”, “data_bytes”, “per_engine” and its value
type of operation: “5GUL”, “5GDL”, “4GUL”, “4GDL”, “FFT”
virtual Function number
engine number

Libvirt Input Plugin

(Telegraf 1.25.0+)

This plugin collects statistics about virtualized guests on a system via libvirt API, created by RedHat’s Emerging Technology group. Metrics are gathered directly from the hypervisor on a host system, which means that Telegraf doesn’t have to be installed and configured on a guest system.

All available metrics for the following statistics groups can be exposed:

state
cpu_total
balloon
vcpu
net
perf
block
iothread
memory
dirtyrate
vcpu_mapping - list of physical CPUs mapped to particular VCPU

P4 Runtime Input Plugin

(Telegraf 1.26.0+)

This plugin gathers metrics about Counter values present in P4 Program loaded onto a networking device. Metrics are collected through gRPC connection with P4Runtime server.

P4 is a language for programming the data plane of network devices, such as Programmable Switches or Programmable Network Interface Cards. The P4Runtime API is a control plane specification to manage the data plane elements of those devices dynamically by a P4 program.

The following metrics are exposed for all counters in P4 programs:

number of bytes gathered in counter
number of packets gathered in counter
index at which metrics are collected in P4 counter

S.M.A.R.T. Input Plugin

(Telegraf 1.5.0+, metrics for Intel NVMe devices available since 1.16.0+)

S.M.A.R.T. is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs) that detects and reports on various indicators of drive reliability, with the intent of anticipating hardware failures.

The plugin supports gathering metrics from smartmontools tool. We added the possibility to also gather statistics from NVMe drives (using nvme-cli tool), including vendor-specific attributes.

Intel specific attributes are exposed for NVMe devices:

Program Fail Count
Erase Fail Count
Wear Leveling Count
End to End Error Detection Count
CRC Error Count
Timed Workload, Media Wear
Timed Workload, Host Reads
Timed Workload, Timer
Thermal Throttle Status
Retry Buffer Overflow Counter
PLL Lock Loss Count
NAND Bytes Written
Host Bytes Written

About the author

Pawel-Zak

Paweł Żak is a Telegraf Maintainer and Project Technical Lead at Intel Corporation where he is responsible for enabling Intel telemetry for easy consumption. He holds a Master of Science in Computer Science from the Gdansk University of Technology, Poland. When he is not behind a screen, you can find him running, riding a bike, or orienteering.

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Intel Leverages Telegraf to Deliver Platform Visibility

By Community / Developer
Jul 14, 2023

Navigate to:

Intel PowerStat Input Plugin

Intel® Resource Director Technology (Intel® RDT) Input Plugin

Intel Performance Monitoring Unit Plugin

RAS Daemon Input Plugin

Intel® Dynamic Load Balancer (Intel® DLB) Input Plugin

Data Plane Development Kit (DPDK) Input Plugin

Intel Baseband Accelerator Input Plugin

Libvirt Input Plugin

P4 Runtime Input Plugin

S.M.A.R.T. Input Plugin

About the author

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT:
A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Navigate to:

Try InfluxDB Cloud

Stop flying blind

Get Updates

Intel Leverages Telegraf to Deliver Platform Visibility

By Community / Developer Jul 14, 2023

Navigate to:

Intel PowerStat Input Plugin

Intel® Resource Director Technology (Intel® RDT) Input Plugin

Intel Performance Monitoring Unit Plugin

RAS Daemon Input Plugin

Intel® Dynamic Load Balancer (Intel® DLB) Input Plugin

Data Plane Development Kit (DPDK) Input Plugin

Intel Baseband Accelerator Input Plugin

Libvirt Input Plugin

P4 Runtime Input Plugin

S.M.A.R.T. Input Plugin

About the author

Ready to get started?

InfluxDB 3 Core & Enterprise GA: The Next Generation Time Series Platform for Developers is Here

Data Lakes and Warehouses

InfluxDB for Industrial IoT: A Live Demonstration

Time Series Databases Explained

Network Monitoring

Time Series Data Analysis: Definitions and Best Techniques in 2025

Product & Solutions

Developers

Company

Sign up for the InfluxData newsletter

Follow Us

By Community / Developer
Jul 14, 2023

InfluxDB for Industrial IoT:
A Live Demonstration