简体   繁体   English

Scrapy。 如何从已解析的HTML中的标记中删除样式属性(类或id属性)

[英]Scrapy. How to remove style attribute (class or id attribute) from tags in parsed HTML

Scrapy. Scrapy。 How to remove style attributes (class or id attributes) from tags in parsed HTML with lxml help? 如何使用lxml帮助从已解析的HTML中的标记中删除样式属性(类或id属性)? Something like lxml.html.clean.Cleaner or something like: lxml.html.clean.Cleaner东西:

for tag in html.xpath('//*[@class]'):
    tag.attrib.pop('class')

You have to import another builtin python class in the spider file 您必须在蜘蛛文件中导入另一个内置的python类

import lxml.html.clean as clean
safe_attrs = set(['src', 'alt', 'href', 'title', 'width', 'height'])
kill_tags = ['object', 'iframe']
cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=safe_attrs, kill_tags=kill_tags)
html_string = "some html string with iframes, objects…"

and then use it like this 然后像这样使用它

cleaned_html = cleaner.clean_html(html_string)

You can customize safe_attrs , and kill_tags to whatever attr and tags you want to remove. 您可以将safe_attrskill_tags自定义为要删除的attr和标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM