In an earlier post I wrote about reading CSV files with Python. This time let’s read some XML using just Python standard library, the ElementTree XML API.
For the demonstration I’ll use this short XML output from Nmap:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE nmaprun> <?xml-stylesheet href="file:///usr/bin/../share/nmap/nmap.xsl" type="text/xsl"?> <!-- Nmap 7.70 scan initiated Sun Dec 12 15:47:50 2021 as: nmap -sn -PE -n -oX - 10.1.1.0/29 --> <nmaprun scanner="nmap" args="nmap -sn -PE -n -oX - 10.1.1.0/29" start="1639316870" startstr="Sun Dec 12 15:47:50 2021" version="7.70" xmloutputversion="1.04"> <verbose level="0"/> <debugging level="0"/> <host> <status state="up" reason="echo-reply" reason_ttl="64"/> <address addr="10.1.1.1" addrtype="ipv4"/> <hostnames> </hostnames> <times srtt="114" rttvar="5000" to="100000"/> </host> <host> <status state="up" reason="echo-reply" reason_ttl="64"/> <address addr="10.1.1.2" addrtype="ipv4"/> <hostnames> </hostnames> <times srtt="108" rttvar="5000" to="100000"/> </host> <host> <status state="up" reason="echo-reply" reason_ttl="64"/> <address addr="10.1.1.5" addrtype="ipv4"/> <hostnames> </hostnames> <times srtt="87" rttvar="5000" to="100000"/> </host> <runstats> <finished time="1639316870" timestr="Sun Dec 12 15:47:50 2021" elapsed="0.36" summary="Nmap done at Sun Dec 12 15:47:50 2021; 8 IP addresses (3 hosts up) scanned in 0.36 seconds" exit="success"/> <hosts up="3" down="5" total="8"/> </runstats> </nmaprun>
Originally the Nmap output was not indented at all but I copied it to VSCode and reformatted it for better visualization.
The structure of our interesting data:
- In the top level there is nmaprun element
- Under nmaprun there are the host elements
- Under the host elements there are address and status elements
- In the address element there is addr attribute
- In the status element there is state attribute
In order to get the list of all addresses and states we need to:
- Iterate over all the
host
elements - For each host element, get the
addr
attribute from theaddress
element and thestate
attribute from thestatus
element
First get the root element from the XML data:
>>> with open("nmapoutput.xml") as file: ... xmldata = file.read() ... >>> from xml.etree import ElementTree >>> root = ElementTree.fromstring(xmldata) >>> root <Element 'nmaprun' at 0x7f10198b7458> >>>
Instead of using ElementTree.fromstring()
you can use ElementTree.parse()
to read the data from the file:
>>> tree = ElementTree.parse("nmapoutput.xml") >>> root = tree.getroot() >>> root <Element 'nmaprun' at 0x7f10198602c8> >>>
The root of the data is the nmaprun
element. The host
elements are retrieved with the .findall()
method of the root element, and then the address
element is retrieved from the first host
element:
>>> root.findall("host") [<Element 'host' at 0x7f10198a3db8>, <Element 'host' at 0x7f1019853778>, <Element 'host' at 0x7f1019853688>] >>> first_host = root.findall("host")[0] >>> first_host.findall("address") [<Element 'address' at 0x7f1019853868>] >>> first_host.find("address") <Element 'address' at 0x7f1019853868> >>>
.findall()
always returns a list while .find()
only returns the first matching element, so .find()
is convenient when retrieving the address
and status
elements.
Now let’s retrieve the addr
attribute using .get()
method:
>>> address = first_host.find("address") >>> address.get("addr") '10.1.1.1' >>>
Similarly we can get the state
attribute from the status
element, or all the attributes as a dictionary with .attrib
if needed:
>>> first_host.find("status").get("state") 'up' >>> first_host.find("status").attrib {'state': 'up', 'reason': 'echo-reply', 'reason_ttl': '64'} >>>
To combine all of the above, we can print all the addresses and states like this:
>>> for host in root.findall("host"): ... print(host.find("address").get("addr"), host.find("status").get("state")) ... 10.1.1.1 up 10.1.1.2 up 10.1.1.5 up >>>
Usually there is a need to save the data in some Python structure for later work. Let’s do that in a Pythonic way:
>>> addrs = {host.find("address").get("addr"): host.find("status").get("state") for host in root.findall("host")} >>> addrs {'10.1.1.1': 'up', '10.1.1.2': 'up', '10.1.1.5': 'up'} >>>
That was dictionary comprehension. It can also be written like this:
>>> addrs = { ... host.find("address").get("addr"): host.find("status").get("state") ... for host in root.findall("host") ... } >>> addrs {'10.1.1.1': 'up', '10.1.1.2': 'up', '10.1.1.5': 'up'} >>>
If only the list of “up” addresses are needed list comprehension can be used:
>>> [ ... host.find("address").get("addr") ... for host in root.findall("host") ... if host.find("status").get("state") == "up" ... ] ['10.1.1.1', '10.1.1.2', '10.1.1.5'] >>>
To conclude reading XML data with ElementTree XML API:
- The element tree root is set with
ElementTree.fromstring()
orElementTree.parse().getroot()
- XML elements are retrieved using
.findall()
(when a list of elements is expected/possible) or.find()
(when a single element is expected) - Attributes of the elements are retrieved using
.get()
or.attrib
References to Python documentation:
xml.etree.ElementTree
- Data Structures for more information about list comprehensions and dict comprehensions
This is great – can you show us how to parse the ports?