[英]Scrapy. How to remove style attribute (class or id attribute) from tags in parsed HTML
Scrapy. Scrapy。 How to remove style attributes (class or id attributes) from tags in parsed HTML with lxml help?
如何使用lxml帮助从已解析的HTML中的标记中删除样式属性(类或id属性)? Something like
lxml.html.clean.Cleaner
or something like: 像
lxml.html.clean.Cleaner
东西:
for tag in html.xpath('//*[@class]'):
tag.attrib.pop('class')
You have to import another builtin python class in the spider file 您必须在蜘蛛文件中导入另一个内置的python类
import lxml.html.clean as clean
safe_attrs = set(['src', 'alt', 'href', 'title', 'width', 'height'])
kill_tags = ['object', 'iframe']
cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=safe_attrs, kill_tags=kill_tags)
html_string = "some html string with iframes, objects…"
and then use it like this 然后像这样使用它
cleaned_html = cleaner.clean_html(html_string)
You can customize safe_attrs
, and kill_tags
to whatever attr and tags you want to remove. 您可以将
safe_attrs
和kill_tags
自定义为要删除的attr和标签。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.