简体   繁体   English

使用 ElementTree 按顺序解析某些 XML 标签

[英]Parse certain XML tags sequentially using ElementTree

I am trying to parse an XML file in a sequential manner, considering only XML-tags which are of interest.我试图以顺序方式解析 XML 文件,只考虑感兴趣的 XML 标签。 A sample XML file is shown below (stored as file.xml ).下面显示了一个示例 XML 文件(存储为file.xml )。 I am only interested in certain XML-tags of known paths, as shown in the Python code snippet below (eg header/para/paratext , body/section/intro/text ).我只对已知路径的某些 XML 标签感兴趣,如下面的 Python 代码片段所示(例如header/para/paratextbody/section/intro/text )。 Different XML files might have a different order of tags, so I do not want to prescribe in which order my known XML-tags will occur.不同的 XML 文件可能有不同的标签顺序,所以我不想规定我已知的 XML 标签出现的顺序。 Any suggestions how to do this in an efficient way without having to loop through the whole XML file?任何建议如何以有效的方式执行此操作而不必遍历整个 XML 文件?

XML file XML文件

<data>
  <header>
    <para>
      <paratext>0 - extract this</paratext>
    </para>
  </header>
  <body>
    <section>
      <intro>
        <text>1 - extract this</text>
      </intro>
      <para>
        <paratext>2 - extract this</paratext>
      </para>
      <items>
        <paratext>do not extract this</paratext>
        <part>
          <para>
            <paratext>3 - extract this</paratext>
          </para>
        </part>
      </items>
    </section>
    <section>
      <text>do not extract this</text>
      <intro>
        <text>4 - extract this</text>
      </intro>
      <para>
        <paratext>5 - extract this</paratext>
      </para>
      <para>
        <paratext>6 - extract this</paratext>
      </para>
    </section>
  </body>
</data>

Desired output : ['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']所需的输出['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']

Sample Python script :示例 Python 脚本

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

### Paths I would like to extract (but sequentially)
[i.text for i in root.findall('header/para/paratext')]
# ['0 - extract this']
[i.text for i in root.findall('body/section/intro/text')]
# ['1 - extract this', '4 - extract this']
[i.text for i in root.findall('body/section/para/paratext')]
# ['2 - extract this', '5 - extract this', '6 - extract this']
[i.text for i in root.findall('body/section/items/part/para/paratext')]
# ['3 - extract this']

I think the best way to do this is to use the union operator (" | ") in XPath .我认为最好的方法是在 XPath 中使用联合运算符(“ | ”) That will select the desired elements in document order.这将按文档顺序选择所需的元素。

Unfortunately, ElementTree has limited XPath support .不幸的是,ElementTree 对XPath 的支持有限

If you can use lxml, it has much better XPath support .如果你可以使用 lxml,它有更好的 XPath 支持

Example...例子...

Python Python

from lxml import etree

tree = etree.parse("file.xml")

print([i.text for i in tree.xpath('header/para/paratext|'
                                  'body/section/intro/text|'
                                  'body/section/para/paratext|'
                                  'body/section/items/part/para/paratext')])

Printed Output打印输出

['0 - extract this', '1 - extract this', '2 - extract this', '3 - extract this', '4 - extract this', '5 - extract this', '6 - extract this']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM