简体   繁体   中英

How to strip(not remove) specified tags from a html string using Python?

The Proper way to strip(not remove) specified tags from an HTML string using Python.

def strip_tags(html, tags=[]):
    ....
    pass #return the html string by stripping the tags from the list

The questions explain it all.

I am to write a python function that takes HTML string as input, and list of tags to be stripped, (mimicking Django template's removetags functionality as it's deprecated )

What's the simplest way of doing this?
The following approaches didn't work for me for the listed reasons:

  • Using regular expressions (for obvious reasons)

  • Clean() method by Bleach library . Surprisingly such a robust library is useless for this requirement, as it follows a whitelist-first approach, while the problem is blacklist-first. Bleach will only be useful to 'keep' certain tags but not for removing certain (unless you are ready to maintain a huge list of all possible ALLOWED_TAGS )

  • lxml.html.Cleaner() combined with remove_tags or kill_tags This is somewhat closer to what I was looking for, but it goes ahead and does(removes) more than what it is supposed to, And there is no way to control the behaviour at the finest, like requesting the Cleaner() to keep the evil <script> tag.

  • BeautifulSoup . This has a method called clear() to remove the specified tags, but it removes the content of the tags while I only need to strip the tags but to keep the content.

Beautiful soup has unwrap() :

It replaces a tag with whatever's inside that tag.

You will have to manually iterate over all tags you want to replace.

You can extend Python's HTMLParser and create your own parser to skip specified tags.

Using the example provided in the given link , I will modify it to strip <h1></h1> tags but keep their data:

from html.parser import HTMLParser

NOT_ALLOWED_TAGS = ['h1']

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

That will return:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body 
# h1 start tag here
Encountered some data  : Parse me!
# h1 close tag here
Encountered an end tag : body
Encountered an end tag : html

You can now maintain a NOT_ALLOWED_TAG list to use for stripping those tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM