简体   繁体   English

如何遍历 XML Python 中的子元素的子元素?

[英]How to iterate over child of child elements in XML Python?

I have an XML structured like:我有一个 XML 结构如下:

<pages>
 <page>
  <textbox>
    <new_line>
     <text>
     </text>
    </new_line>
  </textbox>
 </page>
</pages>

I'm iterating over text elements that are children of the new_line element to join tags with the same size attribute.我正在迭代作为new_line元素的子元素的text元素,以加入具有相同size属性的标签。 But I want to specify that the new_line element has to be inside the textbox element.但我想指定new_line元素必须在textbox元素内。 I tried adding a for loop in my code but it simply doesn't work.我尝试在我的代码中添加一个 for 循环,但它根本不起作用。 Here is the code:这是代码:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "text" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)



newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

EDIT:编辑:

Sample string:示例字符串:

"""<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

Expected output:预期 output:

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </new_line>
        </textbox>
    </page>
</pages>

You can define a recursive function to solve the multi-layer XML in your case.您可以定义递归 function 来解决您的情况下的多层 XML。 I wrote a shortcode for this problem.我为这个问题写了一个简码。

import sys
import xml.etree.ElementTree as etree

def add_sub_element(parent, tag, attrib, text='None'):
    new_feed = etree.SubElement(parent, tag, attrib)

    if(text):
        new_feed.text = text

    return new_feed


def my_tree_mapper(parent_tag, current, element):

    if(current.tag == 'new_line' and parent_tag == 'textbox'):

        current_size = -1
        current_text = ""

        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            if(child_tag == 'text' and 'size' in child_attrib):
                if(child_attrib['size'] == current_size):
                    # For 'text' children with the same size
                    # Append text until we got a different size
                    current_text = current_text + child_text
                else:
                    if(current_size != -1):
                        # Add sub element into the tree when we got a different size
                        sub_element = add_sub_element(
                            current, child_tag, {'size': current_size}, current_text)

                    current_size = child_attrib['size']
                    current_text = child_text

            else:
                if(current_size != -1):
                    # Or add sub element into the tree when we got different tag
                    sub_element = add_sub_element(
                        current, child_tag, {'size': current_size}, current_text)

                # No logic for different tag
                sub_element = add_sub_element(
                    current, child_tag, child_attrib, child_text)
                my_tree_mapper(current.tag, sub_element, child)

                current_size = -1
                current_text = ""
    else:
        # No logic if not satisfy the condition
        for child in element:
            child_tag = child.tag
            child_attrib = child.attrib
            child_text = child.text

            sub_element = add_sub_element(
                current, child_tag, child_attrib, child_text)
            my_tree_mapper(current.tag, sub_element, child)


the_input = """<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

tree = etree.ElementTree(etree.fromstring(the_input))
root = tree.getroot()
new_root = etree.Element(root.tag, root.attrib)

my_tree_mapper('', new_root, root)
print(etree.tostring(new_root))

Hope this can help you, or at least give you some idea.希望这可以帮助你,或者至少给你一些想法。

(In case you want to read more about Incursive Functions document and example . And more about XML etree methods here ) (如果您想阅读更多关于 Incursive Functions文档和示例的信息。更多关于 XML etree 方法的信息请点击此处

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM