简体   繁体   English

如何从 html 中提取所有文本,不包括 css 和 javascript 在 ZA7F5F35426B9237211FC9231B73 中使用 lxml

[英]How to extract all text from html excluding css and javascript with lxml in Python?

How can I extract all text from a html excluding any css and javascript?如何从 html 中提取所有文本,不包括任何 css 和 javascript?

I am trying the following code:我正在尝试以下代码:

r = requests.get(website)
tree = html.fromstring(r.text)
html_text = tree.xpath('//text()')

But it also retrieves all css and javascript content from the website但它也会从网站上检索所有 css 和 javascript 内容

You can use the drop_tree() method to remove elements that you are not interested in.您可以使用drop_tree()方法删除您不感兴趣的元素。

tree = html.fromstring(r.text)

unwanted = tree.xpath('//script|//style')
for u in unwanted:
    u.drop_tree()

html_text = tree.xpath('//text()') 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM