[英]How to extract all text from html excluding css and javascript with lxml in Python?
How can I extract all text from a html excluding any css and javascript?如何从 html 中提取所有文本,不包括任何 css 和 javascript?
I am trying the following code:我正在尝试以下代码:
r = requests.get(website)
tree = html.fromstring(r.text)
html_text = tree.xpath('//text()')
But it also retrieves all css and javascript content from the website但它也会从网站上检索所有 css 和 javascript 内容
You can use the drop_tree()
method to remove elements that you are not interested in.您可以使用
drop_tree()
方法删除您不感兴趣的元素。
tree = html.fromstring(r.text)
unwanted = tree.xpath('//script|//style')
for u in unwanted:
u.drop_tree()
html_text = tree.xpath('//text()')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.