简体   繁体   English

在 Python 3 中使用 XPath 解析 XML

[英]Parsing XML with XPath in Python 3

I have the following xml:我有以下 xml:

<document>
  <internal-code code="201">
    <internal-desc>Biscuits Wrapped</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits (Wrapped)</web-sub-category>
  </internal-code>
  <internal-code code="202">
    <internal-desc>Biscuits Sweet</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits (Sweets)</web-sub-category>
  </internal-code>
  <internal-code code="221">
    <internal-desc>Biscuits Savoury</internal-desc>
    <top-grouping>Finished</top-grouping>
    <web-category>Biscuits</web-category>
    <web-sub-category>Biscuits For Cheese</web-sub-category>
  </internal-code>
  ....
</document>

I have loaded it into a tree using this code:我使用以下代码将其加载到树中:

try:
  groups = etree.parse(PRODUCT_GROUPS_XML_FILEPATH)
  root = groups.getroot()
  internalGroup = root.findall("./internal-code")
  LOG.append("[INFO] product groupings file loaded and parsed ok")
except Exception as e:
  LOG.append("[ERROR] PRODUCT GROUPINGS XML FILE ACCESS PROBLEM")
  LOG.append("[***TERMINATED***]")
  writelog()
  exit()

I would like to use XPath to find the correct and then be able to access the child nodes of that group.我想使用 XPath 找到正确的然后能够访问该组的子节点。 So if I am searching for internal-code 221 and want web-category I would do something like:因此,如果我正在搜索内部代码 221 并想要网络类别,我会执行以下操作:

internalGroup.find("internal-code", 221).get("web-category").text

I am not experienced with XML and Python and I have been staring at this for ages.我对 XML 和 Python 没有经验,而且我一直在关注这个问题多年。 All help very gratefully received.非常感谢所有帮助。 Thanks谢谢

According to xml.etree.ElementTree documentation:根据xml.etree.ElementTree文档:

XPath support XPath 支持

This module provides limited support for XPath expressions for locating elements in a tree.该模块用于定位树中元素的XPath 表达式提供有限的支持 The goal is to support a small subset of the abbreviated syntax;目标是支持缩写语法的一小部分; a full XPath engine is outside the scope of the module.完整的 XPath 引擎超出了模块的范围。

Use lxml :使用lxml

>>> import lxml.etree as ET
>>>
>>> s = '''
... <document>
...   <internal-code code="201">
...     <internal-desc>Biscuits Wrapped</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits (Wrapped)</web-sub-category>
...   </internal-code>
...   <internal-code code="202">
...     <internal-desc>Biscuits Sweet</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits (Sweets)</web-sub-category>
...   </internal-code>
...   <internal-code code="221">
...     <internal-desc>Biscuits Savoury</internal-desc>
...     <top-grouping>Finished</top-grouping>
...     <web-category>Biscuits</web-category>
...     <web-sub-category>Biscuits For Cheese</web-sub-category>
...   </internal-code>
... </document>
... '''
>>>
>>> root = ET.fromstring(s)
>>> for text in root.xpath('.//internal-code[@code="221"]/web-category/text()'):
...     print(text)
...
Biscuits

While I'm a big fan of lxml (see falsetru's answer), which you would need for full xpath support, the standard library's elementtree implementation does support enough to get what you need:虽然我是 lxml 的忠实粉丝(请参阅 falsetru 的答案),您需要完整的 xpath 支持,但标准库的 elementtree 实现确实支持足以获得您需要的内容:

root.findtext('.//internal-code[@code="221]/web-category')

This returns the text property of the first matching element, which is enough if you are sure that code 221 will only occur once.这将返回第一个匹配元素的text属性,如果您确定代码 221 只会出现一次,这就足够了。 If there could be more and you need a list:如果可能还有更多并且您需要一个列表:

[i.text for i in root.findall('.//internal-code[@code="221"]/web-category')]

(note that these examples would also work in lxml) (请注意,这些示例也适用于 lxml)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM