使用lxml和elementtree解析XML

Question

我试图解析XML文档以返回包含ref属性的<input>节点。 一个玩具示例有效，但文档本身应显示匹配项时，它本身返回一个空数组。

玩具实例

import elementtree.ElementTree
from lxml import etree
tree = etree.XML('<body><input ref="blabla"><label>Cats</label></input><input ref="blabla"><label>Dogs</label></input><input ref="blabla"><label>Birds</label></input></body>')
# I can return the relevant input nodes with:
print len(tree.findall(".//input[@ref]"))
2

但是由于某种原因使用以下（精简）文件失败：

的example.xml

<?xml version="1.0"?>
<h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <h:head>
    <h:title>A title</h:title>
  </h:head>
  <h:body>
    <group ref="blabla">
      <label>Group 1</label>
      <input ref="blabla">
        <label>Field 1</label>
      </input>
    </group>
  </h:body>
</h:html>

脚本

import elementtree.ElementTree
from lxml import etree
with open ("example.xml", "r") as myfile:
  xml = myfile.read()
tree = etree.XML(xml)
print len(tree.findall(".//input[@ref]"))
0

知道为什么它失败了，以及如何解决吗？ 我认为这可能与XML标头有关。 非常感谢您的协助。

Answer 1

我认为问题在于整个文档中的元素都位于特定的命名空间中，因此未命名.findall(".//input[@ref]"))表达式与文档中的input元素不匹配，它实际上是http://www.w3.org/2002/xforms命名空间中的一个命名空间input元素。

所以也许试试这个：

.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")

在我的原始答案之后进行了更新，以使用xforms命名空间而不是xhtml命名空间（正如在另一个答案中所述）。

Answer 2

从您的xml可以看出，非前缀元素的xml-namespace是- "http://www.w3.org/2002/xforms" ，这是因为它被定义为父级中没有任何前缀的xmlns元素h:html ，只有前缀为h:元素的命名空间为"http://www.w3.org/1999/xhtml" 。

因此，您还需要在查询中使用该名称空间。 范例-

root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")

示例/演示-

>>> s = """<?xml version="1.0"?>
... <h:html xmlns="http://www.w3.org/2002/xforms" xmlns:ev="http://www.w3.org/2001/xml-events" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
...   <h:head>
...     <h:title>A title</h:title>
...   </h:head>
...   <h:body>
...     <group ref="blabla">
...       <label>Group 1</label>
...       <input ref="blabla">
...         <label>Field 1</label>
...       </input>
...     </group>
...   </h:body>
... </h:html>"""
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring(s)
>>> root.findall(".//{http://www.w3.org/1999/xhtml}input[@ref]")
>>> root.findall(".//{http://www.w3.org/2002/xforms}input[@ref]")
[<Element '{http://www.w3.org/2002/xforms}input' at 0x02288EA0>]

使用lxml和elementtree解析XML

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-08-26 00:40:50

解决方案2
2 2015-08-26 01:44:06

使用lxml和elementtree解析XML

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-08-26 00:40:50

解决方案2 2 2015-08-26 01:44:06

解决方案1
2 已采纳 2015-08-26 00:40:50

解决方案2
2 2015-08-26 01:44:06