[英]Python XPath SyntaxError: invalid predicate
i am trying to parse an xml like 我试图解析像xml一样的
<document>
<pages>
<page>
<paragraph>XBV</paragraph>
<paragraph>GHF</paragraph>
</page>
<page>
<paragraph>ash</paragraph>
<paragraph>lplp</paragraph>
</page>
</pages>
</document>
and here is my code 这是我的代码
import xml.etree.ElementTree as ET
tree = ET.parse("../../xml/test.xml")
root = tree.getroot()
path="./pages/page/paragraph[text()='GHF']"
print root.findall(path)
but i get an error 但我得到一个错误
print root.findall(path)
File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall
return ElementPath.findall(self, path, namespaces)
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall
return list(iterfind(elem, path, namespaces))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind
selector.append(ops[token[0]](next, token))
File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate
raise SyntaxError("invalid predicate")
SyntaxError: invalid predicate
what is wrong with my xpath? 我的xpath有什么问题?
Follow up 跟进
Thanks falsetru, your solution worked. 谢谢你,你的解决方案有效。 I have a follow up.
我有一个跟进。 Now, i want to get all the paragraph elements that come before the paragraph with text
GHF
. 现在,我希望获得带有文本
GHF
的段落之前的所有段落元素。 So in this case i only need the XBV
element. 所以在这种情况下我只需要
XBV
元素。 I want to ignore the ash
and lplp
. 我想忽略
ash
和lplp
。 i guess one way to do this would be 我想有一种方法可以做到这一点
result = []
for para in root.findall('./pages/page/'):
t = para.text.encode("utf-8", "ignore")
if t == "GHF":
break
else:
result.append(para)
but is there a better way to do this? 但是有更好的方法吗?
ElementTree's XPath support is limited. ElementTree的XPath支持有限。 Use other library like
lxml
: 使用像
lxml
这样的其他库:
import lxml.etree
root = lxml.etree.parse('test.xml')
path="./pages/page/paragraph[text()='GHF']"
print root.xpath(path)
As @falsetru mentioned, ElementTree
doesn't support text()
predicate, but it supports matching child element by text, so in this example, it is possible to search for a page
that has a paragraph
with specific text, using the path ./pages/page[paragraph='GHF']
. 正如@falsetru所提到的,
ElementTree
不支持text()
谓词,但它支持按文本匹配子元素,因此在此示例中,可以使用路径搜索具有特定文本paragraph
的page
./pages/page[paragraph='GHF']
。 The problem here is that there are multiple paragraph
tags in a page
, so one would have to iterate for the specific paragraph
. 这里的问题是,有多个
paragraph
的标签page
,所以一个人必须要重复的具体paragraph
。 In my case, I needed to find the version
of a dependency
in a maven pom.xml, and there is only a single version
child so the following worked: 在我的情况下,我需要在maven pom.xml中找到
dependency
的version
,并且只有一个version
子项,所以以下工作:
In [1]: import xml.etree.ElementTree as ET
In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"}
In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text
Out[1]: '1.2.3'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.