I want to iterate through a specific stage of my tags.
For example I would like to iterate through the toplevel objects:
<stage1tag>
<child1tag>bla</child1tag>
<child2tag>blabla</child2tag>
<child3tag><stage2tag>heyho</stage2tag></child3tag></stage1tag>
<stage1tag2>
<stage1tag>
<child1tag>…
...
I only want to iterate through the tags at stage 1 (stage1tag and stage1tag2) In my real xml they are not called child...tag and stage...tag, this is only for a better readability. How can i get the toplevel tags? I am searching for something like
elems = mytree.getlevel(0) #toplevel
for child in elems.iter():
#do something with the childs...
This is one possible solution to this problem, I have not extensively tested it but it is meant to give you an idea on how to approach this kind of problems.
import re
txt = \
'''
<stage1tag>
<child1tag>bla</child1tag>
<child2tag>blabla</child2tag>
<child3tag><stage2tag>heyho</stage2tag></child3tag></stage1tag>
<stage1tag2>
<stage1tag>
<child1tag>
'''
#1: find tags
re1='(<[^>]+>)' # regex string
rg = re.compile(re1,re.IGNORECASE|re.DOTALL)
tags = rg.findall(txt)
#2: determine the level of each tag
lvl = 1 # starting lvl
for t in tags:
if '</' not in t: #it's an open tag, go up one lvl
k = t[1:-1]
print k,':',lvl
lvl += 1
else: #it's a close tag, go one lvl down
lvl -= 1
It prints out:
stage1tag : 1
child1tag : 2
child2tag : 2
child3tag : 2
stage2tag : 3
stage1tag2 : 1
stage1tag : 2
child1tag : 3
That is correct given your xlm.
I assume you have a root element - otherwise the parser will choke with something like "XMLSyntaxError: Extra content at the end of the document". If you lack a root element, just add one:
data = """<root>
<stage1tag id="1">
<child1tag>bla</child1tag>
<child2tag>blabla</child2tag>
<child3tag><stage2tag>heyho</stage2tag></child3tag>
</stage1tag>
<stage1tag id="2">
<child1tag>bla</child1tag>
<child2tag>blabla</child2tag>
<child3tag><stage2tag>heyho</stage2tag></child3tag>
</stage1tag>
</root>
"""
You can use lxml:
>>> import lxml.etree
>>> root = lxml.etree.fromstring(data)
>>> root.getchildren()
[<Element stage1tag at 0x3bf6530>, <Element stage1tag at 0x3bfb7d8>]
>>> for tag in root.getchildren():
print(tag.attrib.get('id'))
1
2
If your document lack a root element I don't think you can call it XML, you have something resembling XML (see Do you always have to have a root node with xml/xsd? )
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.