lxml HtmlElement xpath解析的内容超出了应有的能力

Question

Trying to parse HTML, I fail to loop through all li elements: 尝试解析HTML，我无法遍历所有li元素：

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath("//li/text()"))

I expect this output: 我期望这个输出：

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

but I get this: 但我明白了：

b'<li>one</li>'
['one', 'two']
b'<li>two</li>'
['one', 'two']

How is it possible that xpath can get both li elements' text from item in both iteration steps? 在两个迭代步骤中， xpath怎么可能从item中获取两个li元素的文本？

I can solve this using an counter as an index of course but I would like to understand what's going on. 我当然可以使用计数器作为索引来解决此问题，但我想了解发生了什么。

Answer 1

item.xpath("//li/text()") would search for all li elements in the entire tree. item.xpath("//li/text()")将搜索整个树中的所有li元素。 Since you want the text of the current node, you can just get the text() : item.xpath("text()") . 由于需要当前节点的文本，因此只需获取text() ： item.xpath("text()") 。

Or, even better, just get the text content : 或者，甚至更好，只是获取文本内容 ：

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.text_content())

Answer 2

From Lxml html xpath context : 从Lxml html xpath上下文：

XPath expression //input will match all input elements, anywhere in your document, while .//input will match all inside current context. XPath表达式//input将匹配文档中任何位置的所有输入元素，而.//input将匹配当前上下文中的所有输入元素。

The solution is to use: 解决方案是使用：

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath(".//text()")) #only changed line

Adding . 新增. before // prevents matching entire document and li/ needs to be removed since you are "inside" the li tags already. //之前，防止匹配整个文档，并且li/需要删除，因为您已经在li标签内。

The output is: 输出为：

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

lxml HtmlElement xpath解析的内容超出了应有的能力

问题描述

2 个解决方案

解决方案1
1 2016-08-03 18:35:44

解决方案2
1 已采纳 2016-08-03 18:52:05

lxml HtmlElement xpath解析的内容超出了应有的能力

问题描述

2 个解决方案

解决方案1 1 2016-08-03 18:35:44

解决方案2 1 已采纳 2016-08-03 18:52:05

解决方案1
1 2016-08-03 18:35:44

解决方案2
1 已采纳 2016-08-03 18:52:05