[英]Python lxml XPath : preceding keyword does not give expected result
i am trying to parse an xml document as follows 我正在尝试解析一个xml文档,如下所示
import re
from lxml.html.soupparser import fromstring
inString = """
<doc>
<q></q>
<p1>
<p2 dd="ert" ji="pp">
<p3>1</p3>
<p3>2</p3>
<p3>ABC</p3>
<p3>3</p3>
</p2>
<p2 dd="ert" ji="pp">
<p3>4</p3>
<p3>5</p3>
<p3>ABC</p3>
<p3>6</p3>
</p2>
</p1>
<r></r>
<p1>
<p2 dd="ert" ji="pp">
<p3>7</p3>
<p3>8</p3>
<p3>ABC</p3>
<p3>9</p3>
</p2>
<p2 dd="ert" ji="pp">
<p3>10</p3>
<p3>11</p3>
<p3>ABC</p3>
<p3>12</p3>
</p2>
</p1>
</doc>
"""
root = fromstring(inString)
nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]//preceding::p2//p3")
print " ".join([re.sub('[\s+]', ' ', para.text.encode('utf-8').strip()) for para in nodes])
so, for each <p1>
tag, i want to get to <p3>
tags inside <p2>
. 因此,对于每个
<p1>
标签,我想进入<p2>
内部的<p2>
<p3>
标签。 Then i only want the <p3>
tags upto tag having text like ABC
. 然后我只希望
<p3>
标签最多具有ABC
文本标签。 however, if i run the above code, i get 但是,如果我运行上面的代码,我得到
1 2 ABC 3 4 5 ABC 6 7 8 ABC 9
desired output is 所需的输出是
1 2 4 5 7 8 10 11
also, if i make this change 另外,如果我进行更改
nodes = root.xpath("./doc//p1/p2/p3[contains(text(),'ABC')]")
i get 我得到
ABC ABC ABC ABC
so looks like the second approach is able to get all the <p3>
nodes from the entire document as per the xpath, which is fine. 因此,看起来第二种方法能够按照xpath从整个文档中获取所有
<p3>
节点,这很好。 why doesn't my first query work? 为什么我的第一个查询不起作用?
how do i get the desired output? 我如何获得所需的输出?
Once you've located the p3
containing ABC
, you don't need to get up the tree - just go "sideways" using the preceding-sibling
: 找到包含
ABC
的p3
,您无需上树-只需使用preceding-sibling
“横向”即可:
./doc//p1/p2/p3[contains(text(),'ABC')]/preceding-sibling::p3
Prints 1 2 4 5 7 8 10 11
. 打印
1 2 4 5 7 8 10 11
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.