How to Parse Your XML Data with Telegraf

Navigate to:

In March, we released Telegraf 1.18, which included a wide range of new input and output plugins. One exciting new addition was an XML Parser Plugin that added support for another input data format to parse into InfluxDB metrics.

What is XML?

XML stands for eXtensible Markup Language and is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

XML is similar to HTML in being a markup language but is designed to be self-descriptive and to better store and transport data. For example, when you are trying to exchange data between incompatible systems and data needs to be converted, any data that is incompatible can be lost. XML aims to simplify that data sharing and transportation since it is stored in plain text format. This provides a software- and hardware-independent way of storing, transporting and sharing data.

Understanding your XML data

We will use the terms root, child, sub-child throughout this blog to help you understand which data points you’re trying to parse.

<root>
  <child>
    <subchild>.....</subchild>
  </child>
</root>

XML documents must contain exactly one root element that is the parent of all other elements.

This XML weather example from OpenWeather is a good basic example to help us understand XML data structure and how to parse it.

<current>
   <city id="5004223" name="Oakland">
      <coord lon="-83.3999" lat="42.6667" />
      <country>US</country>
      <timezone>-14400</timezone>
      <sun rise="2021-03-24T11:29:19" set="2021-03-24T23:50:05" />
   </city>
   <temperature value="62.26" min="61" max="64.4" unit="fahrenheit" />
   <feels_like value="54.63" unit="fahrenheit" />
   <humidity value="59" unit="%" />
   <pressure value="1007" unit="hPa" />
   <wind>
      <speed value="12.66" unit="mph" name="Moderate breeze" />
      <gusts value="24.16" />
      <direction value="200" code="SSW" name="South-southwest" />
   </wind>
   <clouds value="75" name="broken clouds" />
   <visibility value="10000" />
   <precipitation mode="no" />
   <weather number="803" value="broken clouds" icon="04d" />
   <lastupdate value="2021-03-24T16:15:35" />
</current>

In our weather data, current is the root element with city, temperature, wind and the other fields at their same level as its child elements.

An XML element is everything including the start tag <element> to the element’s end tag </element>. Some tags can close themselves, as in <coord />. Elements themselves can contain:

  • Text - US in <country>US</country>
  • Attributes - lon="-83.3999" and lat="42.6667" in the <coord> element <coord lon="-83.3999" lat="42.6667"/>
    • Attributes are designed to contain data related to a specific element. This will be especially important when we are parsing our data values. They can be emitted in a way that comes off a little strange but are still valid, such as <foo _="dance"></foo>.
  • Child elements - <city> and <coord> are other elements in the <current> element.

The relationships between elements are described by the terms parent, child, and sibling.

What is XPath?

The Telegraf XML Parser breaks down an XML string into metric fields using XPath expressions and supports most XPath 1.0 functionality. The parser will use XPath syntax to identify and navigate XPath nodes in your XML data. XPath supports over 200 functions, and the functions supported by Telegraf XML Parser are listed in the underlying library repository.

Note: Usually XPath expressions select a node or a node-set and you have call functions like string() or number() to access the node’s content. However, when we discuss the Telegraf XML Parser Plugin in more detail below, you’ll see that it handles this in the following way for convenience: both metric_selection and field_selection only select the node or node-set, so they are normal XPath expressions. However, all other queries will return the node’s “string-value” according to the XPath specification. You can convert the types using functions as shown below.

I found this XPath tutorial particularly helpful in understanding XPath terminology and expressions. There is also this XPath cheat sheet that gives you a one page view of using XPath selectors, expressions, functions and more.

Before parsing any data, take a look at your XML and understand the nodes and node-sets of the data you want to parse. This XPath tester will come in really handy in testing out XPath functions and making sure you are querying the correct path to parse specific XML nodes.

Path
Description
XML returned
current Selects the child node(s) with the name of current relative to the current node. It will not descent in the node tree and only searches the children of the current node
<current>
   <city id="5004223" name="Oakland">
      <coord lon=":83.3999" lat="42.6667" />
      <country>US</country>
      <timezone>:14400</timezone>
      <sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05" />
   </city>
   <temperature value="62.26" min="61" max="64.4" unit="fahrenheit" />
   <feels_like value="54.63" unit="fahrenheit" />
   <humidity value="59" unit="%" />
   <pressure value="1007" unit="hPa" />
   <wind>
      <speed value="12.66" unit="mph" name="Moderate breeze" />
      <gusts value="24.16" />
      <direction value="200" code="SSW" name="South:southwest" />
   </wind>
   <clouds value="75" name="broken clouds" />
   <visibility value="10000" />
   <precipitation mode="no" />
   <weather number="803" value="broken clouds" icon="04d" />
   <lastupdate value="2021:03:24T16:15:35" />
</current>
/current Selects the root element current
current/city Selects all city elements that are children of current
<city id="5004223" name="Oakland">
   <coord lon=":83.3999" lat="42.6667" />
   <country>US</country>
   <timezone>:14400</timezone>
   <sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05" />
</city>
//city Selects all city elements no matter where they are in the document
current//country Selects all country elements within the current element, no matter where they are in the XML Tree
<country>US</country>
current//@name Selects ALL attributes named name
name="Oakland"
name="Moderate breeze"
name="South:southwest"
name="broken clouds"
current/city/@name Or //city/@name Selects attributes named name under city element
name="Oakland"
current/city/* Selects all the child element nodes under the city element
<coord lon=":83.3999" lat="42.6667"/>
<country>US</country>
<timezone>:14400</timezone>
<sun rise="2021:03:24T11:29:19" set="2021:03:24T23:50:05"/>
current/city/@* Selects all attributes in the city element
id="5004223"
name="Oakland"

W3Schools provides an extensive list of XPath syntax and dives deep into XPath axes with additional examples.

Configuring Telegraf to ingest XML

XML is currently one of the many supported input data formats for Telegraf. This means that any input plugin containing the data_format option can be set to xml and begin parsing your XML data, like this:

data_format = "xml"

Let’s discuss how to get your configuration just right to get that XML data into InfluxDB. As mentioned above, the XML parser breaks down an XML string into metric fields using XPath expressions. XPath expressions are what the parser uses to identify and navigate nodes in your XML data.

Here is the plugin’s default configuration for using the XML parser. As with other Telegraf configs, commented lines start with a pound sign (#).

[[inputs.tail]]
  files = ["example.xml"]

  ## Data format to consume.
  ## Each data format has its own unique set of configuration options, read
  ## more about them here:
  ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
  data_format = "xml"

  ## Multiple parsing sections are allowed
  [[inputs.tail.xml]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    #metric_selection = "/Bus/child::Sensor"

    ## Optional: XPath-query to set the metric (measurement) name.
    #metric_name = "string('example')"

    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    #timestamp = "/Gateway/Timestamp"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    #timestamp_format = "2006-01-02T15:04:05Z"

    ## Tag definitions using the given XPath queries.
    [inputs.tail.xml.tags]
      name   = "substring-after(Sensor/@name, ' ')"
      device = "string('the ultimate sensor')"

    ## Integer field definitions using XPath queries.
    [inputs.tail.xml.fields_int]
      consumers = "Variable/@consumers"

    ## Non-integer field definitions using XPath queries.
    ## The field type is defined using XPath expressions such as number(), boolean() or string(). If no conversion is performed the field will be of type string.
    [inputs.tail.xml.fields]
      temperature = "number(Variable/@temperature)"
      power       = "number(Variable/@power)"
      frequency   = "number(Variable/@frequency)"
      ok          = "Mode != 'ok'"

Let’s walk through all the steps and components that will make up your XML parser configuration. Whenever you are setting up an XPath query in your configuration, the specified path can be absolute (starting with  /) or relative. Relative paths use the currently selected node as reference.

  1. Select subset of nodes you want to parse (optional) If you wish to parse only a subset of your XML data, you will use the metric_selection field to designate which part. In our weather example, say we only wanted to parse the data under the wind element, we would set this to current//wind. Let's go ahead and actually read the entire weather XML document, so I'm going to set my metric_selection = "/current". There will be one metric per node selected by metric_selection. A benefit of setting this field is that in subsequent configuration fields, I won't want to add "current/" to my query's pathname.
  2. Set measurement name (optional) You can override the default measurement name (which will most likely be the plugin name) by setting the metric_name field. I'm going to set metric_name = "'weather'" to change the measurement name from http to weather. You can also set the XPath query for metric_name to derive the measurement name directly from a node in the XML document.
  3. Set the value you want as your timestamp and its format (optional) If your XML data contains a specific timestamp you want to assign to your metrics, you will need to set the XPath query of that value. Our weather data has a lastupdate value that indicates the exact time this weather data was recorded. I'll set timestamp = "lastupdate/@value" to read in that value as my timestamp.  If the timestamp field isn't set, the current time will be used as the timestamp for all created metrics.
    From there, you can designate the format of the timestamp you just selected. This timestamp_format can be set to unix, unix_ms, unix_us, unix_ns, or an accepted Go "reference time". If timestamp_format isn't configured, Telegraf will assume your timestamp query is in unix format.
  4. Set the tags you want from your XML data To designate the values in your XML you want as your tags, you will need to configure a tags subsection [inputs.http.xml.tags]. In your subsection you will add a line for each tag in tag-name = queryformat with the XPath query. For our weather data, I will add the city and country names as tags withcity = "city/@name" and country = "city/country". Multiple tags can be set under one subsection.
  5. Configure the fields of integer type you want from your XML data For your XML data values that are integers that you want to read in as fields, you must configure the field names and XPath queries in a fields_int subsection such as [inputs.tail.xml.fields_int]. This is because XML values are limited to a single type, string, so all your data will be of type string if not converted by an XPath function. This will follow the field_name = query format. In our weather data, values such as humidity and clouds are always integers so we will configure them in this subsection. Results of these field_int-queries will always be converted to int64.
    [inputs.http.xml.fields_int]
    humidity = "humidity/@value"
    clouds = "clouds/@value"
  6. Configure the rest of your fields. Be sure to indicate the data type in the XPath function. To add non-integer fields to the metrics, you will add the proper XPath query in a general fields subsection (ex: [inputs.http.xml.fields]) in the field_name = query format. It's crucial here to specify the data type of the field in your XPath query using the type conversion functions of XPath such as number(), boolean() or string(). If no conversion is performed in the query, the field will be of type string. In our weather data we have a combination of number and string values. For example, our wind speed is a number and will be specified as wind_speed = "number(wind/speed/@value)" whereas the wind description is text and will be formatted as a string in wind_desc = "string(wind/speed/@name)".
  7. Select a set of nodes from your XML data you want to parse as fields (optional) If you have a large XML file with a large number of fields that would otherwise need to be individually configured, you can select a subset of them by configuring field_selection with an XPath query to the selection of nodes. This setting will also be commonly used if the node names are not yet known (ex: value of precipitation is not populated unless it's actively raining). Each node that is selected by field_selection forms a new field within the metric.
    You can set the name and value of each field by using the optional field_name and field_value XPath queries. If these queries are not specified, the field's name defaults to the node name and the field's value defaults to the content of the selected field node. It is important to note that field_name and field_value queries are only used if field_selection is specified. You can also use these settings in combination with the other field specification subsections.Based on the multi-node London bicycle example below, to retrieve all the attributes in the info elements, your field_selection settings would be configured as
    field_selection = "child::info"
    field_name = "name(@*[1])"
    field_value = "number(@*[1])"
  8. Expand field names to a path relative to the selected node (optional) If you want your field names that have been selected with field_selection to be expanded to a path relative to the selected node, you will need to set field_name_expansion = true. This settings allows you to flatten out nodes with non-unique names in the subtree. This would be necessary if we selected all leaf nodes as fields and those leaf nodes did not have unique names. If field_name_expansion wasn't set, we would end up with duplicate names in the fields.

Examples!

Basic Parsing example: OpenWeather XML data

I have been referencing the OpenWeatherMap XML API response so far in this blog when explanationing XML concepts and steps on configuring your XML parser. This configuration should help you understand how to parse somewhat simple XML data with Telegraf. There is also a 5 day OpenWeather forecast test case in the plugin’s testcases folder.

You can sign up for a free API key to retrieve this XML data over HTTP. Once you have your API key (this may take a few hours after signing up), you can set your URL to specify the location(s) of your weather. My configuration below retrieves Oakland, New York, and London current weather data in imperial units (blame us Americans not knowing the metric system :)). If you want to test the example below make sure you set your API_KEY as an environment variable to be read by the Telegraf config.

Weather configuration:

[[inputs.http]]
  ## OpenWeatherMap API, need to register for $API_KEY: https://openweathermap.org/api
  urls = [
    "http://api.openweathermap.org/data/2.5/weather?q=Oakland&appid=$API_KEY&mode=xml&units=imperial",
"http://api.openweathermap.org/data/2.5/weather?q=New%20York&appid=$API_KEY&mode=xml&units=imperial",    "http://api.openweathermap.org/data/2.5/weather?q=London&appid=$API_KEY&mode=xml&units=imperial"
    ]
  data_format = "xml"
  ## Drop url and hostname from list of tags
  tagexclude = ["url", "host"]

  ## Multiple parsing sections are allowed
  [[inputs.http.xml]]
    ## Optional: XPath-query to select a subset of nodes from the XML document.
    metric_name = "'weather'"
    ## Optional: XPath-query to set the metric (measurement) name.
    metric_selection = "/current"
    ## Optional: Query to extract metric timestamp.
    ## If not specified the time of execution is used.
    timestamp = "lastupdate/@value"
    ## Optional: Format of the timestamp determined by the query above.
    ## This can be any of "unix", "unix_ms", "unix_us", "unix_ns" or a valid Golang
    ## time format. If not specified, a "unix" timestamp (in seconds) is expected.
    timestamp_format = "2006-01-02T15:04:05"
    
    ## Tag definitions using the given XPath queries.
    [inputs.http.xml.tags]
      city = "city/@name"
      country = "city/country"

    ## Integer field definitions using XPath queries.
    [inputs.http.xml.fields_int]
      humidity = "humidity/@value"
      clouds = "clouds/@value"

    ## Non-integer field definitions using XPath queries.
    ## The field type is defined using XPath expressions such as number(), boolean() or string(). If no conversion is performed the field will be of type string.
    [inputs.http.xml.fields]
      temperature = "number(/temperature/@value)"
      precipitation = "number(precipitation/@value)"
      wind_speed = "number(wind/speed/@value)"
      wind_desc = "string(wind/speed/@name)"
      clouds_desc = "string(clouds/@name)"
      lat = "number(city/coord/@lat)"
      lon = "number(city/coord/@lon)"
      ## If "precipitation/@mode" value returns "no", is_it_raining will return false
      is_it_raining = "precipitation/@mode = 'yes'"

Most of the settings for this weather configuration are explained above. The last field for is_it_raining displays how you can use an XPath operator in your configuration to return a node-set, a string, a Boolean, or a number:

is_it_raining = "precipitation/@mode = 'yes'"

Weather output:

weather,city=New\ York,country=US clouds=1i,clouds_desc="clear sky",humidity=38i,is_it_raining=false,lat=40.7143,lon=-74.006,precipitation=0,temperature=58.15,wind_desc="Gentle Breeze",wind_speed=8.05 1617128228000000000
weather,city=London,country=GB clouds=0i,clouds_desc="clear sky",humidity=24i,is_it_raining=false,lat=51.5085,lon=-0.1257,precipitation=0,temperature=66.56,wind_desc="Light breeze",wind_speed=5.75 1617128914000000000
weather,city=Oakland,country=US clouds=90i,clouds_desc="overcast clouds",humidity=34i,is_it_raining=false,lat=42.6667,lon=-83.3999,precipitation=0,temperature=64.54,wind_desc="Moderate breeze",wind_speed=17.27 1617128758000000000

Multi-node selection example: COVID-19 Vaccine Distribution Allocations by Jurisdiction

Your XML data will commonly contain similar metrics for multiple sections (each section could be a different device; in this example, each section represents a different jurisdiction). You can use the XML Parser for multi-node selection to generate metrics for each chunk of data.

Considering this blog is being written during spring 2021, there is plenty of COVID-19 data out there. To stay somewhat optimistic, let’s take a look at some COVID-19 vaccine XML data provided from the Center of Disease Control (CDC). The CDC provides weekly allocation of vaccines by jurisdiction. There is an HTTP XML file for each vaccine manufacturer: Moderna, Pfizer or Janssen/Johnson & Johnson. Each vaccine has its own personality type too!

This COVID vaccine XML data will be a good example on how to do multi-node selection with the XML parser.

<response>
   <row>
      <row _id="row-vuan~mg8h_vwjk" _uuid="00000000-0000-0000-9614-D811B3DD0141" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-vuan~mg8h_vwjk">
         <jurisdiction>Connecticut</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>50310</_1st_dose_allocations>
         <_2nd_dose_allocations>50310</_2nd_dose_allocations>
      </row>
      <row _id="row-suay.uwx5_hiiz" _uuid="00000000-0000-0000-C448-E7F5D3B8E3CA" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-suay.uwx5_hiiz">
         <jurisdiction>Maine</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>19890</_1st_dose_allocations>
         <_2nd_dose_allocations>19890</_2nd_dose_allocations>
      </row>
      <row _id="row-dhdq_gsf8~rzrd" _uuid="00000000-0000-0000-6882-622E1430CDFA" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-dhdq_gsf8~rzrd">
         <jurisdiction>Massachusetts</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>95940</_1st_dose_allocations>
         <_2nd_dose_allocations>95940</_2nd_dose_allocations>
      </row>
      <row _id="row-jehx-8sxy_8dma" _uuid="00000000-0000-0000-56CD-DCA4760B56BC" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-jehx-8sxy_8dma">
         <jurisdiction>New York</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>153270</_1st_dose_allocations>
         <_2nd_dose_allocations>153270</_2nd_dose_allocations>
      </row>
      <row _id="row-chrx-6f37~qbn9" _uuid="00000000-0000-0000-30C3-4B8A23B1DF14" _position="0" _address="https://data.cdc.gov/resource/saz5-9hgg/row-chrx-6f37~qbn9">
         <jurisdiction>New York City</jurisdiction>
         <week_of_allocations>2021-04-05T00:00:00</week_of_allocations>
         <_1st_dose_allocations>117000</_1st_dose_allocations>
         <_2nd_dose_allocations>117000</_2nd_dose_allocations>
      </row>
   </row>
</response>

The above script was snipped of CDC COVID-19 Vaccine Distribution Allocations by Jurisdiction - Pfizer

This multi-node dataset doesn’t have many child values for us to configure but many parent subsections. We will use week_of_allocations as our timestamp, jurisdiction as a tag, _1st_dose_allocations and _2nd_dose_allocations as fields. Even though the Janssen/Johnson & Johnson data doesn’t contain the _2nd_dose_allocations (one and done), we do not need a separate configuration for it but the parser just won’t emit a field for it.

I included the processors.enum to my configuration. In the XML data itself there is no indicator besides the URL to indicate which manufacturer the data belongs to. The enum processor I configured will add a tag for the manufacturer name for its corresponding URL.

Configuration:

[[inputs.http]]
  urls = [
    "https://data.cdc.gov/api/views/b7pe-5nws/rows.xml", # Moderna
    "https://data.cdc.gov/api/views/saz5-9hgg/rows.xml", # Pfizer
    "https://data.cdc.gov/api/views/w9zu-fywh/rows.xml" # Janssen/Johnson & Johnson

    ]
  data_format = "xml"
  ## Drop hostname from list of tags
  tagexclude = ["host"]

    [[inputs.http.xml]]
        metric_selection = "//row"
        metric_name = "'cdc-vaccines'"
        timestamp = "week_of_allocations"
        timestamp_format = "2006-01-02T15:04:05"

        [inputs.http.xml.tags]
            state   = "jurisdiction"

        [inputs.http.xml.fields_int]
            1st_dose_allocations = "_1st_dose_allocations"
            2nd_dose_allocations = "_2nd_dose_allocations"


[[processors.enum]]
  [[processors.enum.mapping]]
    ## Name of the tag to map. Globs accepted.
    tag = "url"

    ## Destination tag or field to be used for the mapped value.  By default the
    ## source tag or field is used, overwriting the original value.
    dest = "vaccine_type"

    ## Table of mappings
    [processors.enum.mapping.value_mappings]
      "https://data.cdc.gov/api/views/b7pe-5nws/rows.xml" = "Moderna"
      "https://data.cdc.gov/api/views/saz5-9hgg/rows.xml" = "Pfizer"
      "https://data.cdc.gov/api/views/w9zu-fywh/rows.xml" = "Janssen"

Output (snippet of output based of the sample of XML vaccine data above — full configuration will provide a much larger output)

cdc-vaccines,state=Connecticut,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=60840i,2nd_dose_allocations=60840i 1617580800000000000
cdc-vaccines,state=Maine,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=23400i,2nd_dose_allocations=23400i 1617580800000000000
cdc-vaccines,state=Massachusetts,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=117000i,2nd_dose_allocations=117000i 1617580800000000000
cdc-vaccines,state=New\ York,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=188370i,2nd_dose_allocations=188370i 1617580800000000000
cdc-vaccines,state=New\ York\ City,url=https://data.cdc.gov/api/views/saz5-9hgg/rows.xml,vaccine_type=Pfizer 1st_dose_allocations=143910i,2nd_dose_allocations=143910i 1617580800000000000

Using field selectors for batch field processing (example: London bicycle data)

Your XML data will often contain metrics with so many fields that it would be tedious to configure each field in the [inputs.tail.xml.fields] sub-section. Also, your XML data might generate fields that are unknown during configuration. In these situations, you can use field selectors to parse these metrics.

For our example, we’ll use the London hire for cycle data provided by Transport for London. The data contains the latest time the data was updated (lastUpdate) that we’ll use as our timestamp. The info nodes contain the bicycle station status information that we’ll use as our fields.

<stations lastUpdate="1617397861012" version="2.0">
</stations>
<response>
   <location id="1" name="River Street , Clerkenwell">
      <info terminalName="001023" />
      <info lat="51.52916347" />
      <info long="-0.109970527" />
      <info installDate="1278947280000" />
      <temporary>false</temporary>
      <info nbBikes="10" />
      <info nbEmptyDocks="9" />
      <info nbDocks="19" />
   </location>
   <location id="2" name="Phillimore Gardens, Kensington">
      <info terminalName="001018" />
      <info lat="51.49960695" />
      <info long="-0.197574246" />
      <info installDate="1278585780000" />
      <temporary>false</temporary>
      <info nbBikes="28" />
      <info nbEmptyDocks="9" />
      <info nbDocks="37" />
   </location>
   <location id="3" name="Christopher Street, Liverpool Street">
      <info terminalName="001012" />
      <info lat="51.52128377" />
      <info long="-0.084605692" />
      <info installDate="1278240360000" />
      <temporary>false</temporary>
      <info nbBikes="2" />
      <info nbEmptyDocks="30" />
      <info nbDocks="32" />
   </location>
</response>

In our configuration, we’ll still use the metric_selection option to select all location nodes. For each location we then use field_selection to select all child nodes of the location as field-nodes. This field selection is relative to the selected nodes — for each selected field-node we will configure field_name and field_value to determine the field’s name and value, respectively. The field_name pulls the name of the first attribute of the node, while field_value pulls the value of the first attribute and converts the result to a number.

For our non-numerical fields, we can still use [inputs.tail.xml.fields] in conjunction with field_selection. We will still set the node temporary that contains a string to read in as a field. Also, note that my timestamp is outside my metric_selection so I had to make sure the XPath query to pull lastUpdate was an absolute path predicated with /.

Configuration:

[[inputs.tail]]
  files = ["/pathname/london-cycle-for-hire.xml"]
  data_format = "xml"

  [[inputs.tail.xml]]
    metric_selection = "response/child::location"
    metric_name = "string('bikes')"

    timestamp = "/stations/@lastUpdate"
    timestamp_format = "unix_ms"

    field_selection = "child::info"
    field_name = "name(@*[1])"
    field_value = "number(@*[1])"

    [inputs.tail.xml.tags]
      address = "@name"
      id = "@id"

    [inputs.tail.xml.fields]
      placement = "string(temporary)"

Output:

bikes,address=River\ Street\ \,\ Clerkenwell,host=MBP15-SWANG.local,id=1 installDate=1278947280000,lat=51.52916347,long=-0.109970527,nbBikes=10,nbDocks=19,nbEmptyDocks=9,placement="false",terminalName=1023 1617397861000000000
bikes,address=Phillimore\ Gardens\,\ Kensington,host=MBP15-SWANG.local,id=2 installDate=1278585780000,lat=51.49960695,long=-0.197574246,nbBikes=28,nbDocks=37,nbEmptyDocks=9,placement="false",terminalName=1018 1617397861000000000
bikes,address=Christopher\ Street\,\ Liverpool\ Street,host=MBP15-SWANG.local,id=3 installDate=1278240360000,lat=51.52128377,long=-0.084605692,nbBikes=2,nbDocks=32,nbEmptyDocks=30,placement="false",terminalName=1012 1617397861000000000

More examples

There is a folder of XML test cases in the Telegraf GitHub repository of more examples. If you think you have an example XML document + XML parser configuration that will be helpful to the community, please contribute a PR containing the documents.

Quick tips and other helpful resources

If you’re looking to do generic troubleshooting, be sure to set debug = "true" in your agent settings and the parser will (for the *_selection settings) walk up the nodes if the selection is empty and print how many children it found. This will help you see which part of the query could be causing the problem.

An XPath tester like XPather or Code Beautify’s XPath Tester will be your best friend while configuring your XML parser to help you make sure you are selecting the proper XPath query for your data. It will make configuration a lot less frustrating when you can visibly see what nodes your XPath query is selecting.

A few syntax things to reiterate are that when you are setting up an XPath query in your configuration, the specified path can be absolute (starting with /) or relative. This will be important to remember if you are querying a node outside of your metric selection. If you don’t include the starting /, you’d end up querying a node in your selected metrics that may not exist.

Lastly, something I kept running into when querying to attribute (ex: lonlat in <coord lon="-83.3999" lat="42.6667"/>) is to remember to include the \ before @. I would accidentally query current/city/coord@lat which would result in nothing when the correct query is current/city/coord/@lat.

Here are some resources that will help you have a better understanding of the Telegraf XML Parser and XPath:

Incredibly massive shoutout to Sven Rebhan for building this plugin!

If you end up with any questions about parsing your XML data, please reach out to us (@Sven Rebhan if you’d like to chat with Sven specifically) in the #telegraf channel of our InfluxData Community Slack or post any questions on our Community Site.

Want to learn more about data acquisition through Telegraf? Register for free for InfluxDays EMEA to attend Jess Ingrassellino’s “Data Acquisition” talk covering Telegraf, CLI Integration to the cloud, and client libraries, on May 18, 2021.