[英]Remove contents of <style>…</style> tags using html5lib or bleach
I've been using the excellent bleach library for removing bad HTML.我一直在使用优秀的漂白库来删除错误的 HTML。
I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:我有一堆从 Microsoft Word 粘贴进来的 HTML 文档,其中包含以下内容:
<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>
Using bleach (with the style
tag implicitly disallowed), leaves me with:使用漂白剂(隐式禁止使用
style
标签),让我有:
st1:*{behavior:url(#ieooui) }
Which isn't helpful.这没有帮助。 Bleach seems only to have options to:
漂白剂似乎只有以下选项:
I'm looking for a third option - remove the tags and their contents.我正在寻找第三个选项 - 删除标签及其内容。
Is there any way to use bleach or html5lib to completely remove the style
tag and its contents?有什么方法可以使用漂白剂或 html5lib 来完全删除
style
标签及其内容? The documentation for html5lib isn't really a great deal of help. html5lib的文档并没有很大的帮助。
I was able to strip the contents of tags using a filter based on this approach: https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters .我能够使用基于这种方法的过滤器去除标签的内容: https : //bleach.readthedocs.io/en/latest/clean.html?highlight= strip# html5lib-filters-filters 。 It does leave an empty
<style></style>
in the output, but that's harmless.它确实在输出中留下了一个空的
<style></style>
,但这是无害的。
from bleach.sanitizer import Cleaner
from bleach.html5lib_shim import Filter
class StyleTagFilter(Filter):
"""
https://bleach.readthedocs.io/en/latest/clean.html?highlight=strip#html5lib-filters-filters
"""
def __iter__(self):
in_style_tag = False
for token in Filter.__iter__(self):
if token["type"] == "StartTag" and token["name"] == "style":
in_style_tag = True
elif token["type"] == "EndTag":
in_style_tag = False
elif in_style_tag:
# If we are in a style tag, strip the contents
token["data"] = ""
yield token
# You must include "style" in the tags list
cleaner = Cleaner(tags=["div", "style"], strip=True, filters=[StyleTagFilter])
cleaned = cleaner.clean("<div><style>.some_style { font-weight: bold; }</style>Some text</div>")
assert cleaned == "<div><style></style>Some text</div>"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.