简体   繁体   English

如何使用Python从html字符串中剥离(不删除)指定的标签?

[英]How to strip(not remove) specified tags from a html string using Python?

The Proper way to strip(not remove) specified tags from an HTML string using Python. 使用Python从HTML字符串中剥离(不删除)指定标签的正确方法。

def strip_tags(html, tags=[]):
    ....
    pass #return the html string by stripping the tags from the list

The questions explain it all. 这些问题说明了一切。

I am to write a python function that takes HTML string as input, and list of tags to be stripped, (mimicking Django template's removetags functionality as it's deprecated ) 我要编写一个将HTML字符串作为输入以及要剥离的标签列表的python函数,(不建议使用Django模板的removetags功能)

What's the simplest way of doing this? 最简单的方法是什么?
The following approaches didn't work for me for the listed reasons: 由于所列原因,以下方法对我不起作用:

  • Using regular expressions (for obvious reasons) 使用正则表达式(出于明显的原因)

  • Clean() method by Bleach library . Bleach库的Clean()方法 Surprisingly such a robust library is useless for this requirement, as it follows a whitelist-first approach, while the problem is blacklist-first. 令人惊讶的是,这种健壮的库对于此要求毫无用处,因为它遵循白名单优先方法,而问题是黑名单优先方法。 Bleach will only be useful to 'keep' certain tags but not for removing certain (unless you are ready to maintain a huge list of all possible ALLOWED_TAGS ) Bleach仅对“保留”某些标签有用,而对于删除某些标签无用(除非您准备维护所有可能的ALLOWED_TAGS的庞大列表)

  • lxml.html.Cleaner() combined with remove_tags or kill_tags This is somewhat closer to what I was looking for, but it goes ahead and does(removes) more than what it is supposed to, And there is no way to control the behaviour at the finest, like requesting the Cleaner() to keep the evil <script> tag. lxml.html.Cleaner()remove_tagskill_tags结合使用这与我在寻找的内容有些接近,但是它进行得比预期的要多,并且无法执行的操作也无法控制。最好的,例如请求Cleaner()保留邪恶的<script>标签。

  • BeautifulSoup . 美丽的汤 This has a method called clear() to remove the specified tags, but it removes the content of the tags while I only need to strip the tags but to keep the content. 它有一个称为clear()的方法来删除指定的标签,但是它删除了标签的内容,而我只需要剥离标签但保留内容。

Beautiful soup has unwrap() : 美丽的汤具有unwrap()

It replaces a tag with whatever's inside that tag. 它用标签内的任何内容替换标签。

You will have to manually iterate over all tags you want to replace. 您将必须手动遍历要替换的所有标签。

You can extend Python's HTMLParser and create your own parser to skip specified tags. 您可以扩展Python的HTMLParser并创建自己的解析器以跳过指定的标签。

Using the example provided in the given link , I will modify it to strip <h1></h1> tags but keep their data: 使用给定链接中提供的示例,我将其修改为去除<h1></h1>标签,但保留其数据:

from html.parser import HTMLParser

NOT_ALLOWED_TAGS = ['h1']

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        if tag not in NOT_ALLOWED_TAGS:
            print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

That will return: 那将返回:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body 
# h1 start tag here
Encountered some data  : Parse me!
# h1 close tag here
Encountered an end tag : body
Encountered an end tag : html

You can now maintain a NOT_ALLOWED_TAG list to use for stripping those tags. 现在,您可以维护一个NOT_ALLOWED_TAG列表,以剥离这些标签。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM