简体   繁体   中英

How to remove links from HTML completely with Bleach?

Bleach strips non-whitelisted tags from HTML, but leaves child nodes, eg

>>> import bleach
>>> bleach.clean("<a href="">stays</a>", strip=True, tags=[])
'stays'
>>>  

How can the entire element along with its children be removed?

You should use lxml . Bleach is simply for cleaning data & ensuring security/safety in the markup you store.

You can use lxml to parse structured data like HTML or XML.

Consider a simple html file;

<html>
<body>
<p>Hello, World!</p>
</body>
</html>
from lxml import html

root = html.parse("hello_world.html").getroot()

print(html.tostring(root))

# <html><body><p>Hello, World!</p></body></html>

p = root.find("body/p")

p.drop_tree()

print(html.tostring(root))

# <html><body></body></html>

On a related note, if you want to look into some more advanced parsing with lxml , one of my oldest questions on here was around getting python to parse xml & write python code out of it. Writing a Python tool to convert XML to Python?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM