简体   繁体   English

lxml HtmlElement xpath解析的内容超出了应有的能力

[英]lxml HtmlElement xpath parses more than it should be able to

Trying to parse HTML, I fail to loop through all li elements: 尝试解析HTML,我无法遍历所有li元素:

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath("//li/text()"))

I expect this output: 我期望这个输出:

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

but I get this: 但我明白了:

b'<li>one</li>'
['one', 'two']
b'<li>two</li>'
['one', 'two']

How is it possible that xpath can get both li elements' text from item in both iteration steps? 在两个迭代步骤中, xpath怎么可能从item中获取两个li元素的文本?

I can solve this using an counter as an index of course but I would like to understand what's going on. 我当然可以使用计数器作为索引来解决此问题,但我想了解发生了什么。

item.xpath("//li/text()") would search for all li elements in the entire tree. item.xpath("//li/text()")将搜索整个树中的所有li元素。 Since you want the text of the current node, you can just get the text() : item.xpath("text()") . 由于需要当前节点的文本,因此只需获取text()item.xpath("text()")

Or, even better, just get the text content : 或者,甚至更好,只是获取文本内容

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.text_content())

From Lxml html xpath context : Lxml html xpath上下文

XPath expression //input will match all input elements, anywhere in your document, while .//input will match all inside current context. XPath表达式//input将匹配文档中任何位置的所有输入元素,而.//input将匹配当前上下文中的所有输入元素。

The solution is to use: 解决方案是使用:

from lxml import html

page="<ul><li>one</li><li>two</li></ul>"
tree = html.fromstring(page)

for item in tree.xpath("//li"):
  print(html.tostring(item))
  print(item.xpath(".//text()")) #only changed line

Adding . 新增. before // prevents matching entire document and li/ needs to be removed since you are "inside" the li tags already. //之前,防止匹配整个文档,并且li/需要删除,因为您已经在li标签内。

The output is: 输出为:

b'<li>one</li>'
['one']
b'<li>two</li>'
['two']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM