繁体   English   中英

如何使用 ElementTree 在 Python 中递归迭代 XML 标记?

[英]how to recursively iterate over XML tags in Python using ElementTree?

我正在尝试使用 ElementTree 遍历树中的所有节点。

我做类似的事情:

tree = ET.parse("/tmp/test.xml")

root = tree.getroot()

for child in root:
    ### do something with child

问题是 child 是一个 Element 对象而不是ElementTree对象,所以我不能进一步研究它并递归迭代它的元素。 有没有办法对“根”进行不同的迭代,以便它迭代树中的顶级节点(直接子节点)并返回与根本身相同的类?

要遍历所有节点,请使用ElementTree上的iter方法,而不是根元素。

根是一个元素,就像树中的其他元素一样,只有它自己的属性和子元素的上下文。 ElementTree具有所有元素的上下文。

例如,给定这个 xml

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

您可以执行以下操作

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> for elem in tree.iter():
...     print elem
... 
<Element 'data' at 0x10b2d7b50>
<Element 'country' at 0x10b2d7b90>
<Element 'rank' at 0x10b2d7bd0>
<Element 'year' at 0x10b2d7c50>
<Element 'gdppc' at 0x10b2d7d10>
<Element 'neighbor' at 0x10b2d7e90>
<Element 'neighbor' at 0x10b2d7ed0>
<Element 'country' at 0x10b2d7f10>
<Element 'rank' at 0x10b2d7f50>
<Element 'year' at 0x10b2d7f90>
<Element 'gdppc' at 0x10b2d7fd0>
<Element 'neighbor' at 0x10b2db050>
<Element 'country' at 0x10b2db090>
<Element 'rank' at 0x10b2db0d0>
<Element 'year' at 0x10b2db110>
<Element 'gdppc' at 0x10b2db150>
<Element 'neighbor' at 0x10b2db190>
<Element 'neighbor' at 0x10b2db1d0>

添加到 Robert Christie 的回答中,可以通过将 Element 转换为 ElementTree 来使用fromstring()遍历所有节点:

import xml.etree.ElementTree as ET

e = ET.ElementTree(ET.fromstring(xml_string))
for elt in e.iter():
    print "%s: '%s'" % (elt.tag, elt.text)

您还可以像这样访问特定元素:

country= tree.findall('.//country')

然后循环range(len(country))并访问

除了罗伯特克里斯蒂接受的答案之外,单独打印值和标签非常容易:

tree = ET.parse('test.xml')
for elem in tree.iter():
    print(elem.tag, elem.text)

虽然iter()非常好,但我需要一种在跟踪嵌套级别的同时遍历 xml 层次结构的方法,而iter()对此毫无帮助。 我想要像iterparse()这样的东西,它在层次结构的每个级别发出开始和结束事件,但我已经有了 ElementTree 所以不希望不必要的步骤/开销转换为字符串并使用iterparse()重新解析要求。

很惊讶我找不到这个,我不得不自己写:

def iterwalk(root, events=None, tags=None):
    """Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
    Returns an iterator providing (event, elem) pairs.
    Events are start and end
    events is a list of events to emit - defaults to ["start","end"]
    tags is a single tag or a list of tags to emit events for - if empty/None events are generated for all tags
    """
    # each stack entry consists of a list of the xml element and a second entry initially None
    # if the second entry is None a start is emitted and all children of current element are put into the second entry
    # if the second entry is a non-empty list the first item in it is popped and then a new stack entry is created
    # once the second entry is an empty list, and end is generated and then stack is popped
    stack = [[root,None]]
    tags = [] if tags is None else tags if type(tags) == list else [tags]
    events = events or ["start","end"]
    def iterator():
        while stack:
            elnow,children = stack[-1]
            if children is None:
                # this is the start of elnow so emit a start and put its children into the stack entry
                if ( not tags or elnow.tag in tags ) and "start" in events:
                    yield ("start",elnow)
                # put the children into the top stack entry
                stack[-1][1] = list(elnow)
            elif len(children)>0:
                # do a child and remove it
                thischild = children.pop(0)
                # and now create a new stack entry for this child
                stack.append([thischild,None])                
            else:
                # finished these children - emit the end
                if ( not tags or elnow.tag in tags ) and "end" in events:
                    yield ("end",elnow)
                stack.pop()
    return iterator

# myxml is my parsed XML which has nested Binding tags, I want to count the depth of nesting

# Now explore the structure
it = iterwalk( myxml, tags='Binding'))
level = 0
for event,el in it():
    if event == "start":
        level += 1
        
    print( f"{level} {el.tag=}" )
    
    if event == "end":
        level -= 1

使用堆栈,以便您可以在沿层次结构向下移动时发出开始事件,然后正确回溯。 堆栈中的最后一个条目最初是 [el, None] 因此发出 el 的启动事件,第二个条目更新为 [el,children] ,每个孩子在进入时从孩子中删除,直到最后一个孩子之后已经完成,条目是 [el,[]],此时发出 el 的结束事件,并从堆栈中删除顶部条目。

我用堆栈这样做是因为我不喜欢调试递归代码,而且无论如何我不确定如何编写递归迭代器函数。

这是一个更容易理解的递归版本,但如果它不是那么简单并且出现问题,将很难调试 - 我yield from

def iterwalk1(root, events=None, tags=None):
    """Recuirsive version - Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
    Returns an iterator providing (event, elem) pairs.
    Events are start and end
    events is a list of events to emit - defaults to ["start","end"]
    tags is a single tag or a list of tags to emit events for - if None or empty list then events are generated for all tags
    """
    tags = [] if tags is None else tags if type(tags) == list else [tags]
    events = events or ["start","end"]
    
    def recursiveiterator(el,suppressyield=False):
        if not suppressyield and ( not tags or el.tag in tags ) and "start" in events:
            yield ("start",el)
        for child in list(el):
            yield from recursiveiterator(child)
        if not suppressyield and  ( not tags or el.tag in tags ) and "end" in events:
            yield ("end",el)
            
    def iterator():
        yield from recursiveiterator( root, suppressyield=True )
        
    return iterator

xml 到 dict 的出色解决方案:请参阅https://stackoverflow.com/a/68082847/3505444

def etree_to_dict(t):
    if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
    return {
        **t.attrib,
        'text': t.text,
        **{e.tag: etree_to_dict(e) for e in t}
    }

和 :

def nested_dict_pairs_iterator(dict_obj):
    ''' This function accepts a nested dictionary as argument
        and iterate over all values of nested dictionaries
    '''
    # Iterate over all key-value pairs of dict argument
    for key, value in dict_obj.items():
        # Check if value is of dict type
        if isinstance(value, dict):
            # If value is dict then iterate over all its values
            for pair in  nested_dict_pairs_iterator(value):
                yield (key, *pair)
        else:
            # If value is not dict type then yield the value
            yield (key, value)

最后 :

root_dict = etree_to_dict(myet.root)
for pair in nested_dict_pairs_iterator(root_dict):
    print(pair)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM