Reading XML with Python

In an earlier post I wrote about reading CSV files with Python. This time let’s read some XML using just Python standard library, the ElementTree XML API.

For the demonstration I’ll use this short XML output from Nmap:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nmaprun>
<?xml-stylesheet href="file:///usr/bin/../share/nmap/nmap.xsl" type="text/xsl"?>
<!-- Nmap 7.70 scan initiated Sun Dec 12 15:47:50 2021 as: nmap -sn -PE -n -oX - 10.1.1.0/29 -->
<nmaprun scanner="nmap" args="nmap -sn -PE -n -oX - 10.1.1.0/29" start="1639316870" startstr="Sun Dec 12 15:47:50 2021" version="7.70" xmloutputversion="1.04">
    <verbose level="0"/>
    <debugging level="0"/>
    <host>
        <status state="up" reason="echo-reply" reason_ttl="64"/>
        <address addr="10.1.1.1" addrtype="ipv4"/>
        <hostnames>
        </hostnames>
        <times srtt="114" rttvar="5000" to="100000"/>
    </host>
    <host>
        <status state="up" reason="echo-reply" reason_ttl="64"/>
        <address addr="10.1.1.2" addrtype="ipv4"/>
        <hostnames>
        </hostnames>
        <times srtt="108" rttvar="5000" to="100000"/>
    </host>
    <host>
        <status state="up" reason="echo-reply" reason_ttl="64"/>
        <address addr="10.1.1.5" addrtype="ipv4"/>
        <hostnames>
        </hostnames>
        <times srtt="87" rttvar="5000" to="100000"/>
    </host>
    <runstats>
        <finished time="1639316870" timestr="Sun Dec 12 15:47:50 2021" elapsed="0.36" summary="Nmap done at Sun Dec 12 15:47:50 2021; 8 IP addresses (3 hosts up) scanned in 0.36 seconds" exit="success"/>
        <hosts up="3" down="5" total="8"/>
    </runstats>
</nmaprun>

Originally the Nmap output was not indented at all but I copied it to VSCode and reformatted it for better visualization.

The structure of our interesting data:

  • In the top level there is nmaprun element
  • Under nmaprun there are the host elements
  • Under the host elements there are address and status elements
  • In the address element there is addr attribute
  • In the status element there is state attribute

In order to get the list of all addresses and states we need to:

  • Iterate over all the host elements
  • For each host element, get the addr attribute from the address element and the state attribute from the status element

First get the root element from the XML data:

>>> with open("nmapoutput.xml") as file:
...     xmldata = file.read()
...
>>> from xml.etree import ElementTree
>>> root = ElementTree.fromstring(xmldata)
>>> root
<Element 'nmaprun' at 0x7f10198b7458>
>>>

Instead of using ElementTree.fromstring() you can use ElementTree.parse() to read the data from the file:

>>> tree = ElementTree.parse("nmapoutput.xml")
>>> root = tree.getroot()
>>> root
<Element 'nmaprun' at 0x7f10198602c8>
>>>

The root of the data is the nmaprun element. The host elements are retrieved with the .findall() method of the root element, and then the address element is retrieved from the first host element:

>>> root.findall("host")
[<Element 'host' at 0x7f10198a3db8>, <Element 'host' at 0x7f1019853778>, <Element 'host' at 0x7f1019853688>]
>>> first_host = root.findall("host")[0]
>>> first_host.findall("address")
[<Element 'address' at 0x7f1019853868>]
>>> first_host.find("address")
<Element 'address' at 0x7f1019853868>
>>>

.findall() always returns a list while .find() only returns the first matching element, so .find() is convenient when retrieving the address and status elements.

Now let’s retrieve the addr attribute using .get() method:

>>> address = first_host.find("address")
>>> address.get("addr")
'10.1.1.1'
>>>

Similarly we can get the state attribute from the status element, or all the attributes as a dictionary with .attrib if needed:

>>> first_host.find("status").get("state")
'up'
>>> first_host.find("status").attrib
{'state': 'up', 'reason': 'echo-reply', 'reason_ttl': '64'}
>>>

To combine all of the above, we can print all the addresses and states like this:

>>> for host in root.findall("host"):
...     print(host.find("address").get("addr"), host.find("status").get("state"))
...
10.1.1.1 up
10.1.1.2 up
10.1.1.5 up
>>>

Usually there is a need to save the data in some Python structure for later work. Let’s do that in a Pythonic way:

>>> addrs = {host.find("address").get("addr"): host.find("status").get("state") for host in root.findall("host")}
>>> addrs
{'10.1.1.1': 'up', '10.1.1.2': 'up', '10.1.1.5': 'up'}
>>>

That was dictionary comprehension. It can also be written like this:

>>> addrs = {
...     host.find("address").get("addr"): host.find("status").get("state")
...     for host in root.findall("host")
... }
>>> addrs
{'10.1.1.1': 'up', '10.1.1.2': 'up', '10.1.1.5': 'up'}
>>>

If only the list of “up” addresses are needed list comprehension can be used:

>>> [
...     host.find("address").get("addr")
...     for host in root.findall("host")
...     if host.find("status").get("state") == "up"
... ]
['10.1.1.1', '10.1.1.2', '10.1.1.5']
>>>

To conclude reading XML data with ElementTree XML API:

  • The element tree root is set with ElementTree.fromstring() or ElementTree.parse().getroot()
  • XML elements are retrieved using .findall() (when a list of elements is expected/possible) or .find() (when a single element is expected)
  • Attributes of the elements are retrieved using .get() or .attrib

References to Python documentation:

1 Comment

Add a Comment
  1. This is great – can you show us how to parse the ports?

Leave a Reply