I am trying to parse an XML file in a sequential manner, considering only XML-tags which are of interest. A sample XML file is shown below (stored as file.xml ). I am only interested in certain XML-tags of known paths, as shown in the Python code snippet below (eg header/para/paratext , body/section/intro/text ). Different XML files might have a different order of tags, so I do not want to prescribe in which order my known XML-tags will occur. Any suggestions how to do this in an efficient way without having to loop through the whole XML file?
XML file
<data>
<header>
<para>
<paratext>0 - extract this</paratext>
</para>
</header>
<body>
<section>
<intro>
<text>1 - extract this</text>
</intro>
<para>
<paratext>2 - extract this</paratext>
</para>
<items>
<paratext>do not extract this</paratext>
<part>
<para>
<paratext>3 - extract this</paratext>
</para>
</part>
</items>
</section>
<section>
<text>do not extract this</text>
<intro>
<text>4 - extract this</text>
</intro>
<para>
<paratext>5 - extract this</paratext>
</para>
<para>
<paratext>6 - extract this</paratext>
</para>
</section>
</body>
</data>
Desired output : ['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']
Sample Python script :
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
### Paths I would like to extract (but sequentially)
[i.text for i in root.findall('header/para/paratext')]
# ['0 - extract this']
[i.text for i in root.findall('body/section/intro/text')]
# ['1 - extract this', '4 - extract this']
[i.text for i in root.findall('body/section/para/paratext')]
# ['2 - extract this', '5 - extract this', '6 - extract this']
[i.text for i in root.findall('body/section/items/part/para/paratext')]
# ['3 - extract this']
I think the best way to do this is to use the union operator (" |
") in XPath . That will select the desired elements in document order.
Unfortunately, ElementTree has limited XPath support .
If you can use lxml, it has much better XPath support .
Example...
Python
from lxml import etree
tree = etree.parse("file.xml")
print([i.text for i in tree.xpath('header/para/paratext|'
'body/section/intro/text|'
'body/section/para/paratext|'
'body/section/items/part/para/paratext')])
Printed Output
['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.