如何使用 lxml.html 从 HTML 元素获取文本

Question

I've been trying to get a full text hosted inside a <div> element from the web page https://www.list-org.com/company/11665809 .我一直在尝试从 web 页面https://www.list-org.com/company/11665809获取托管在<div>元素中的全文。
The element should contain a sub-string "Арбитраж".该元素应包含一个子字符串“Арбитраж”。
And it does, because my code确实如此，因为我的代码

for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
    print(div)

returns response返回响应

Element div at 0x15480d93ac8

But when I'm trying to get the full text itself by using method div.text , it returns None但是当我尝试使用方法div.text获取全文本身时，它返回None
Which is a strange result, I think.我认为这是一个奇怪的结果。 What should I do?我应该怎么办？
Any help would be greatly appreciated.任何帮助将不胜感激。 As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.以及关于学习 HTML（不是一个精明的程序员）基础知识的源的建议，以避免将来出现这样一个简单的问题。

Answer 1

This is one of these strange things that happens when xpath is handled by a host language and library.这是当 xpath 由宿主语言和库处理时发生的这些奇怪的事情之一。 When you use the xpath expression当您使用 xpath 表达式时

 .//div[contains(text(), "Арбитраж")]

the search is performed according to xpath rules, which considers the target text as contained within the target div .搜索是根据 xpath 规则执行的，该规则认为目标文本包含在目标div中。 When you go on to the next line:当您将 go 转到下一行时：

print(div.text)

you are using lxml.html, which apparently doesn't regard the target text as part of the div text, because it's preceded by the <i> tag.您正在使用 lxml.html，它显然不将目标文本视为div文本的一部分，因为它前面有<i>标记。 To get to it, with lxml.html, you have to use:要实现它，使用 lxml.html，您必须使用：

print(div.text_content())

or with xpath only:或仅使用 xpath：

print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])

It seems lxml.etree and beautifulsoup use different approaches.似乎 lxml.etree 和 beautifulsoup 使用不同的方法。 See this interesting discussion here. 在这里看到这个有趣的讨论。

如何使用 lxml.html 从 HTML 元素获取文本

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-05-10 10:50:25

如何使用 lxml.html 从 HTML 元素获取文本

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-05-10 10:50:25

解决方案1
1 已采纳 2020-05-10 10:50:25