如何使用 Bleach 完全删除来自 HTML 的链接？

Question

Bleach strips non-whitelisted tags from HTML, but leaves child nodes, eg Bleach从 HTML 中剥离非白名单标签，但留下子节点，例如

>>> import bleach
>>> bleach.clean("<a href="">stays</a>", strip=True, tags=[])
'stays'
>>>

How can the entire element along with its children be removed?如何删除整个元素及其子元素？

Answer 1

You should use lxml .你应该使用lxml 。 Bleach is simply for cleaning data & ensuring security/safety in the markup you store. Bleach 仅用于清理数据并确保您存储的标记的安全性。

You can use lxml to parse structured data like HTML or XML.您可以使用lxml来解析结构化数据，例如 HTML 或 XML。

Consider a simple html file;考虑一个简单的 html 文件；

<html>
<body>
<p>Hello, World!</p>
</body>
</html>

from lxml import html

root = html.parse("hello_world.html").getroot()

print(html.tostring(root))

# <html><body><p>Hello, World!</p></body></html>

p = root.find("body/p")

p.drop_tree()

print(html.tostring(root))

# <html><body></body></html>

On a related note, if you want to look into some more advanced parsing with lxml , one of my oldest questions on here was around getting python to parse xml & write python code out of it.在相关说明中，如果您想使用lxml研究一些更高级的解析，我在这里最古老的问题之一是让 python 解析 xml 并从中编写 python 代码。 Writing a Python tool to convert XML to Python? 编写一个 Python 工具将 XML 转换为 Python？

如何使用 Bleach 完全删除来自 HTML 的链接？

问题描述

1 个解决方案

解决方案1
0 2020-09-01 21:26:44

如何使用 Bleach 完全删除来自 HTML 的链接？

问题描述

1 个解决方案

解决方案1 0 2020-09-01 21:26:44

解决方案1
0 2020-09-01 21:26:44