[英]How to iterate over XML tags in Python using ElementTree & save to CSV
[英]how to recursively iterate over XML tags in Python using ElementTree?
我正在尝试使用 ElementTree 遍历树中的所有节点。
我做类似的事情:
tree = ET.parse("/tmp/test.xml")
root = tree.getroot()
for child in root:
### do something with child
问题是 child 是一个 Element 对象而不是ElementTree
对象,所以我不能进一步研究它并递归迭代它的元素。 有没有办法对“根”进行不同的迭代,以便它迭代树中的顶级节点(直接子节点)并返回与根本身相同的类?
要遍历所有节点,请使用ElementTree
上的iter
方法,而不是根元素。
根是一个元素,就像树中的其他元素一样,只有它自己的属性和子元素的上下文。 ElementTree
具有所有元素的上下文。
例如,给定这个 xml
<?xml version="1.0"?>
<data>
<country name="Liechtenstein">
<rank>1</rank>
<year>2008</year>
<gdppc>141100</gdppc>
<neighbor name="Austria" direction="E"/>
<neighbor name="Switzerland" direction="W"/>
</country>
<country name="Singapore">
<rank>4</rank>
<year>2011</year>
<gdppc>59900</gdppc>
<neighbor name="Malaysia" direction="N"/>
</country>
<country name="Panama">
<rank>68</rank>
<year>2011</year>
<gdppc>13600</gdppc>
<neighbor name="Costa Rica" direction="W"/>
<neighbor name="Colombia" direction="E"/>
</country>
</data>
您可以执行以下操作
>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> for elem in tree.iter():
... print elem
...
<Element 'data' at 0x10b2d7b50>
<Element 'country' at 0x10b2d7b90>
<Element 'rank' at 0x10b2d7bd0>
<Element 'year' at 0x10b2d7c50>
<Element 'gdppc' at 0x10b2d7d10>
<Element 'neighbor' at 0x10b2d7e90>
<Element 'neighbor' at 0x10b2d7ed0>
<Element 'country' at 0x10b2d7f10>
<Element 'rank' at 0x10b2d7f50>
<Element 'year' at 0x10b2d7f90>
<Element 'gdppc' at 0x10b2d7fd0>
<Element 'neighbor' at 0x10b2db050>
<Element 'country' at 0x10b2db090>
<Element 'rank' at 0x10b2db0d0>
<Element 'year' at 0x10b2db110>
<Element 'gdppc' at 0x10b2db150>
<Element 'neighbor' at 0x10b2db190>
<Element 'neighbor' at 0x10b2db1d0>
添加到 Robert Christie 的回答中,可以通过将 Element 转换为 ElementTree 来使用fromstring()
遍历所有节点:
import xml.etree.ElementTree as ET
e = ET.ElementTree(ET.fromstring(xml_string))
for elt in e.iter():
print "%s: '%s'" % (elt.tag, elt.text)
您还可以像这样访问特定元素:
country= tree.findall('.//country')
然后循环range(len(country))
并访问
除了罗伯特克里斯蒂接受的答案之外,单独打印值和标签非常容易:
tree = ET.parse('test.xml')
for elem in tree.iter():
print(elem.tag, elem.text)
虽然iter()
非常好,但我需要一种在跟踪嵌套级别的同时遍历 xml 层次结构的方法,而iter()
对此毫无帮助。 我想要像iterparse()
这样的东西,它在层次结构的每个级别发出开始和结束事件,但我已经有了 ElementTree 所以不希望不必要的步骤/开销转换为字符串并使用iterparse()
重新解析要求。
很惊讶我找不到这个,我不得不自己写:
def iterwalk(root, events=None, tags=None):
"""Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
Returns an iterator providing (event, elem) pairs.
Events are start and end
events is a list of events to emit - defaults to ["start","end"]
tags is a single tag or a list of tags to emit events for - if empty/None events are generated for all tags
"""
# each stack entry consists of a list of the xml element and a second entry initially None
# if the second entry is None a start is emitted and all children of current element are put into the second entry
# if the second entry is a non-empty list the first item in it is popped and then a new stack entry is created
# once the second entry is an empty list, and end is generated and then stack is popped
stack = [[root,None]]
tags = [] if tags is None else tags if type(tags) == list else [tags]
events = events or ["start","end"]
def iterator():
while stack:
elnow,children = stack[-1]
if children is None:
# this is the start of elnow so emit a start and put its children into the stack entry
if ( not tags or elnow.tag in tags ) and "start" in events:
yield ("start",elnow)
# put the children into the top stack entry
stack[-1][1] = list(elnow)
elif len(children)>0:
# do a child and remove it
thischild = children.pop(0)
# and now create a new stack entry for this child
stack.append([thischild,None])
else:
# finished these children - emit the end
if ( not tags or elnow.tag in tags ) and "end" in events:
yield ("end",elnow)
stack.pop()
return iterator
# myxml is my parsed XML which has nested Binding tags, I want to count the depth of nesting
# Now explore the structure
it = iterwalk( myxml, tags='Binding'))
level = 0
for event,el in it():
if event == "start":
level += 1
print( f"{level} {el.tag=}" )
if event == "end":
level -= 1
使用堆栈,以便您可以在沿层次结构向下移动时发出开始事件,然后正确回溯。 堆栈中的最后一个条目最初是 [el, None] 因此发出 el 的启动事件,第二个条目更新为 [el,children] ,每个孩子在进入时从孩子中删除,直到最后一个孩子之后已经完成,条目是 [el,[]],此时发出 el 的结束事件,并从堆栈中删除顶部条目。
我用堆栈这样做是因为我不喜欢调试递归代码,而且无论如何我不确定如何编写递归迭代器函数。
这是一个更容易理解的递归版本,但如果它不是那么简单并且出现问题,将很难调试 - 我yield from
def iterwalk1(root, events=None, tags=None):
"""Recuirsive version - Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
Returns an iterator providing (event, elem) pairs.
Events are start and end
events is a list of events to emit - defaults to ["start","end"]
tags is a single tag or a list of tags to emit events for - if None or empty list then events are generated for all tags
"""
tags = [] if tags is None else tags if type(tags) == list else [tags]
events = events or ["start","end"]
def recursiveiterator(el,suppressyield=False):
if not suppressyield and ( not tags or el.tag in tags ) and "start" in events:
yield ("start",el)
for child in list(el):
yield from recursiveiterator(child)
if not suppressyield and ( not tags or el.tag in tags ) and "end" in events:
yield ("end",el)
def iterator():
yield from recursiveiterator( root, suppressyield=True )
return iterator
xml 到 dict 的出色解决方案:请参阅https://stackoverflow.com/a/68082847/3505444
def etree_to_dict(t):
if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
return {
**t.attrib,
'text': t.text,
**{e.tag: etree_to_dict(e) for e in t}
}
和 :
def nested_dict_pairs_iterator(dict_obj):
''' This function accepts a nested dictionary as argument
and iterate over all values of nested dictionaries
'''
# Iterate over all key-value pairs of dict argument
for key, value in dict_obj.items():
# Check if value is of dict type
if isinstance(value, dict):
# If value is dict then iterate over all its values
for pair in nested_dict_pairs_iterator(value):
yield (key, *pair)
else:
# If value is not dict type then yield the value
yield (key, value)
最后 :
root_dict = etree_to_dict(myet.root)
for pair in nested_dict_pairs_iterator(root_dict):
print(pair)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.