[英]How to get text from HTML element by using lxml.html
I've been trying to get a full text hosted inside a <div>
element from the web page https://www.list-org.com/company/11665809 .我一直在尝试从 web 页面https://www.list-org.com/company/11665809获取托管在
<div>
元素中的全文。
The element should contain a sub-string "Арбитраж".该元素应包含一个子字符串“Арбитраж”。
And it does, because my code确实如此,因为我的代码
for div in tree.xpath('.//div[contains(text(), "Арбитраж")]'):
print(div)
returns response返回响应
Element div at 0x15480d93ac8
But when I'm trying to get the full text itself by using method div.text
, it returns None
但是当我尝试使用方法
div.text
获取全文本身时,它返回None
Which is a strange result, I think.我认为这是一个奇怪的结果。 What should I do?
我应该怎么办?
Any help would be greatly appreciated.任何帮助将不胜感激。 As well as an advice about source to learn basics of HTML (not a savvy programmer guy) to avoid such an easy question in the future.
以及关于学习 HTML(不是一个精明的程序员)基础知识的源的建议,以避免将来出现这样一个简单的问题。
This is one of these strange things that happens when xpath is handled by a host language and library.这是当 xpath 由宿主语言和库处理时发生的这些奇怪的事情之一。 When you use the xpath expression
当您使用 xpath 表达式时
.//div[contains(text(), "Арбитраж")]
the search is performed according to xpath rules, which considers the target text as contained within the target div
.搜索是根据 xpath 规则执行的,该规则认为目标文本包含在目标
div
中。 When you go on to the next line:当您将 go 转到下一行时:
print(div.text)
you are using lxml.html, which apparently doesn't regard the target text as part of the div
text, because it's preceded by the <i>
tag.您正在使用 lxml.html,它显然不将目标文本视为
div
文本的一部分,因为它前面有<i>
标记。 To get to it, with lxml.html, you have to use:要实现它,使用 lxml.html,您必须使用:
print(div.text_content())
or with xpath only:或仅使用 xpath:
print(tree.xpath('.//div[contains(text(), "Арбитраж")]/text()')[0])
It seems lxml.etree and beautifulsoup use different approaches.似乎 lxml.etree 和 beautifulsoup 使用不同的方法。 See this interesting discussion here.
在这里看到这个有趣的讨论。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.