Scrapy。如何从已解析的HTML中的标记中删除样式属性（类或id属性）

Question

Scrapy. Scrapy。 How to remove style attributes (class or id attributes) from tags in parsed HTML with lxml help? 如何使用lxml帮助从已解析的HTML中的标记中删除样式属性（类或id属性）？ Something like lxml.html.clean.Cleaner or something like: 像lxml.html.clean.Cleaner东西：

for tag in html.xpath('//*[@class]'):
    tag.attrib.pop('class')

Answer 1

You have to import another builtin python class in the spider file 您必须在蜘蛛文件中导入另一个内置的python类

import lxml.html.clean as clean
safe_attrs = set(['src', 'alt', 'href', 'title', 'width', 'height'])
kill_tags = ['object', 'iframe']
cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=safe_attrs, kill_tags=kill_tags)
html_string = "some html string with iframes, objects…"

and then use it like this 然后像这样使用它

cleaned_html = cleaner.clean_html(html_string)

You can customize safe_attrs , and kill_tags to whatever attr and tags you want to remove. 您可以将safe_attrs和kill_tags自定义为要删除的attr和标签。

Scrapy。如何从已解析的HTML中的标记中删除样式属性（类或id属性）

问题描述

1 个解决方案

解决方案1
0 2018-10-14 04:18:11

Scrapy。 如何从已解析的HTML中的标记中删除样式属性（类或id属性）

问题描述

1 个解决方案

解决方案1 0 2018-10-14 04:18:11

Scrapy。如何从已解析的HTML中的标记中删除样式属性（类或id属性）

解决方案1
0 2018-10-14 04:18:11