[英]lxml HtmlElement xpath parses more than it should be able to
Trying to parse HTML, I fail to loop through all li
elements: 尝试解析HTML,我无法遍历所有li
元素:
from lxml import html
page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.xpath("//li/text()"))
I expect this output: 我期望这个输出:
b'<li>one</li>'
['one']
b'<li>two</li>'
['two']
but I get this: 但我明白了:
b'<li>one</li>'
['one', 'two']
b'<li>two</li>'
['one', 'two']
How is it possible that xpath
can get both li
elements' text from item
in both iteration steps? 在两个迭代步骤中, xpath
怎么可能从item
中获取两个li
元素的文本?
I can solve this using an counter as an index of course but I would like to understand what's going on. 我当然可以使用计数器作为索引来解决此问题,但我想了解发生了什么。
item.xpath("//li/text()")
would search for all li
elements in the entire tree. item.xpath("//li/text()")
将搜索整个树中的所有li
元素。 Since you want the text of the current node, you can just get the text()
: item.xpath("text()")
. 由于需要当前节点的文本,因此只需获取text()
: item.xpath("text()")
。
Or, even better, just get the text content : 或者,甚至更好,只是获取文本内容 :
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.text_content())
From Lxml html xpath context : 从Lxml html xpath上下文 :
XPath expression
//input
will match all input elements, anywhere in your document, while.//input
will match all inside current context. XPath表达式//input
将匹配文档中任何位置的所有输入元素,而.//input
将匹配当前上下文中的所有输入元素。
The solution is to use: 解决方案是使用:
from lxml import html
page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)
for item in tree.xpath("//li"):
print(html.tostring(item))
print(item.xpath(".//text()")) #only changed line
Adding .
新增.
before //
prevents matching entire document and li/
needs to be removed since you are "inside" the li
tags already. //
之前,防止匹配整个文档,并且li/
需要删除,因为您已经在li
标签内。
The output is: 输出为:
b'<li>one</li>'
['one']
b'<li>two</li>'
['two']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.