簡體   English   中英

如何使用 ElementTree 在 Python 中遞歸迭代 XML 標記?

[英]how to recursively iterate over XML tags in Python using ElementTree?

我正在嘗試使用 ElementTree 遍歷樹中的所有節點。

我做類似的事情:

tree = ET.parse("/tmp/test.xml")

root = tree.getroot()

for child in root:
    ### do something with child

問題是 child 是一個 Element 對象而不是ElementTree對象,所以我不能進一步研究它並遞歸迭代它的元素。 有沒有辦法對“根”進行不同的迭代,以便它迭代樹中的頂級節點(直接子節點)並返回與根本身相同的類?

要遍歷所有節點,請使用ElementTree上的iter方法,而不是根元素。

根是一個元素,就像樹中的其他元素一樣,只有它自己的屬性和子元素的上下文。 ElementTree具有所有元素的上下文。

例如,給定這個 xml

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

您可以執行以下操作

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.xml')
>>> for elem in tree.iter():
...     print elem
... 
<Element 'data' at 0x10b2d7b50>
<Element 'country' at 0x10b2d7b90>
<Element 'rank' at 0x10b2d7bd0>
<Element 'year' at 0x10b2d7c50>
<Element 'gdppc' at 0x10b2d7d10>
<Element 'neighbor' at 0x10b2d7e90>
<Element 'neighbor' at 0x10b2d7ed0>
<Element 'country' at 0x10b2d7f10>
<Element 'rank' at 0x10b2d7f50>
<Element 'year' at 0x10b2d7f90>
<Element 'gdppc' at 0x10b2d7fd0>
<Element 'neighbor' at 0x10b2db050>
<Element 'country' at 0x10b2db090>
<Element 'rank' at 0x10b2db0d0>
<Element 'year' at 0x10b2db110>
<Element 'gdppc' at 0x10b2db150>
<Element 'neighbor' at 0x10b2db190>
<Element 'neighbor' at 0x10b2db1d0>

添加到 Robert Christie 的回答中,可以通過將 Element 轉換為 ElementTree 來使用fromstring()遍歷所有節點:

import xml.etree.ElementTree as ET

e = ET.ElementTree(ET.fromstring(xml_string))
for elt in e.iter():
    print "%s: '%s'" % (elt.tag, elt.text)

您還可以像這樣訪問特定元素:

country= tree.findall('.//country')

然后循環range(len(country))並訪問

除了羅伯特克里斯蒂接受的答案之外,單獨打印值和標簽非常容易:

tree = ET.parse('test.xml')
for elem in tree.iter():
    print(elem.tag, elem.text)

雖然iter()非常好,但我需要一種在跟蹤嵌套級別的同時遍歷 xml 層次結構的方法,而iter()對此毫無幫助。 我想要像iterparse()這樣的東西,它在層次結構的每個級別發出開始和結束事件,但我已經有了 ElementTree 所以不希望不必要的步驟/開銷轉換為字符串並使用iterparse()重新解析要求。

很驚訝我找不到這個,我不得不自己寫:

def iterwalk(root, events=None, tags=None):
    """Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
    Returns an iterator providing (event, elem) pairs.
    Events are start and end
    events is a list of events to emit - defaults to ["start","end"]
    tags is a single tag or a list of tags to emit events for - if empty/None events are generated for all tags
    """
    # each stack entry consists of a list of the xml element and a second entry initially None
    # if the second entry is None a start is emitted and all children of current element are put into the second entry
    # if the second entry is a non-empty list the first item in it is popped and then a new stack entry is created
    # once the second entry is an empty list, and end is generated and then stack is popped
    stack = [[root,None]]
    tags = [] if tags is None else tags if type(tags) == list else [tags]
    events = events or ["start","end"]
    def iterator():
        while stack:
            elnow,children = stack[-1]
            if children is None:
                # this is the start of elnow so emit a start and put its children into the stack entry
                if ( not tags or elnow.tag in tags ) and "start" in events:
                    yield ("start",elnow)
                # put the children into the top stack entry
                stack[-1][1] = list(elnow)
            elif len(children)>0:
                # do a child and remove it
                thischild = children.pop(0)
                # and now create a new stack entry for this child
                stack.append([thischild,None])                
            else:
                # finished these children - emit the end
                if ( not tags or elnow.tag in tags ) and "end" in events:
                    yield ("end",elnow)
                stack.pop()
    return iterator

# myxml is my parsed XML which has nested Binding tags, I want to count the depth of nesting

# Now explore the structure
it = iterwalk( myxml, tags='Binding'))
level = 0
for event,el in it():
    if event == "start":
        level += 1
        
    print( f"{level} {el.tag=}" )
    
    if event == "end":
        level -= 1

使用堆棧,以便您可以在沿層次結構向下移動時發出開始事件,然后正確回溯。 堆棧中的最后一個條目最初是 [el, None] 因此發出 el 的啟動事件,第二個條目更新為 [el,children] ,每個孩子在進入時從孩子中刪除,直到最后一個孩子之后已經完成,條目是 [el,[]],此時發出 el 的結束事件,並從堆棧中刪除頂部條目。

我用堆棧這樣做是因為我不喜歡調試遞歸代碼,而且無論如何我不確定如何編寫遞歸迭代器函數。

這是一個更容易理解的遞歸版本,但如果它不是那么簡單並且出現問題,將很難調試 - 我yield from

def iterwalk1(root, events=None, tags=None):
    """Recuirsive version - Incrementally walks XML structure (like iterparse but for an existing ElementTree structure)
    Returns an iterator providing (event, elem) pairs.
    Events are start and end
    events is a list of events to emit - defaults to ["start","end"]
    tags is a single tag or a list of tags to emit events for - if None or empty list then events are generated for all tags
    """
    tags = [] if tags is None else tags if type(tags) == list else [tags]
    events = events or ["start","end"]
    
    def recursiveiterator(el,suppressyield=False):
        if not suppressyield and ( not tags or el.tag in tags ) and "start" in events:
            yield ("start",el)
        for child in list(el):
            yield from recursiveiterator(child)
        if not suppressyield and  ( not tags or el.tag in tags ) and "end" in events:
            yield ("end",el)
            
    def iterator():
        yield from recursiveiterator( root, suppressyield=True )
        
    return iterator

xml 到 dict 的出色解決方案:請參閱https://stackoverflow.com/a/68082847/3505444

def etree_to_dict(t):
    if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
    return {
        **t.attrib,
        'text': t.text,
        **{e.tag: etree_to_dict(e) for e in t}
    }

和 :

def nested_dict_pairs_iterator(dict_obj):
    ''' This function accepts a nested dictionary as argument
        and iterate over all values of nested dictionaries
    '''
    # Iterate over all key-value pairs of dict argument
    for key, value in dict_obj.items():
        # Check if value is of dict type
        if isinstance(value, dict):
            # If value is dict then iterate over all its values
            for pair in  nested_dict_pairs_iterator(value):
                yield (key, *pair)
        else:
            # If value is not dict type then yield the value
            yield (key, value)

最后 :

root_dict = etree_to_dict(myet.root)
for pair in nested_dict_pairs_iterator(root_dict):
    print(pair)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM