简体   繁体   English

使用BeautifulSoup删除标记但保留其内容

[英]Remove a tag using BeautifulSoup but keep its contents

Currently I have code that does something like this: 目前我的代码执行如下操作:

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

Except I don't want to throw away the contents inside the invalid tag. 除了我不想丢弃无效标签内的内容。 How do I get rid of the tag but keep the contents inside when calling soup.renderContents()? 如何在删除标签但在调用soup.renderContents()时保留内容?

Current versions of the BeautifulSoup library have an undocumented method on Tag objects called replaceWithChildren(). 当前版本的BeautifulSoup库在Tag对象上有一个名为replaceWithChildren()的未记录方法。 So, you could do something like this: 所以,你可以这样做:

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

Looks like it behaves like you want it to and is fairly straightforward code (although it does make a few passes through the DOM, but this could easily be optimized.) 看起来它的行为就像你想要的那样,并且是相当简单的代码(尽管它确实通过DOM进行了一些传递,但这可以很容易地进行优化。)

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString , etc. Try this: 我使用的策略是将标签替换为其内容,如果它们是NavigableString类型,如果它们不是,则将它们递归到它们中并用NavigableString替换它们的内容等。试试这个:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is: 结果是:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. 我在另一个问题上给出了同样的答案。 It seems to come up a lot. 它似乎出现了很多。

Although this has already been mentoned by other people in the comments, I thought I'd post a full answer showing how to do it with Mozilla's Bleach. 虽然评论中已经有其他人提到了这一点,但我想我会发布一个完整的答案,展示如何使用Mozilla的Bleach。 Personally, I think this is a lot nicer than using BeautifulSoup for this. 就个人而言,我认为这比使用BeautifulSoup要好得多。

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

I have a simpler solution but I don't know if there's a drawback to it. 我有一个更简单的解决方案,但我不知道它是否有缺点。

UPDATE: there's a drawback, see Jesse Dhillon's comment. 更新:有一个缺点,请参阅Jesse Dhillon的评论。 Also, another solution will be to use Mozilla's Bleach instead of BeautifulSoup. 另外,另一种解决方案是使用Mozilla的Bleach而不是BeautifulSoup。

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

This will also print <div><p>Hello there my friend!</p></div> as desired. 这也将根据需要打印<div><p>Hello there my friend!</p></div>

you can use soup.text 你可以使用soup.text

.text removes all tags and concatenate all text. .text删除所有标记并连接所有文本。

You'll presumably have to move tag's children to be children of tag's parent before you remove the tag -- is that what you mean? 在删除标签之前,您可能必须将标签的子项移动为标记父项的子项 - 这是您的意思吗?

If so, then, while inserting the contents in the right place is tricky, something like this should work: 如果是这样,那么,虽然在正确的位置插入内容是棘手的,这样的事情应该工作:

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

with the example value, this prints <div><p>Hello there my friend!</p></div> as desired. 使用示例值,根据需要打印<div><p>Hello there my friend!</p></div>

None of the proposed answered seemed to work with BeautifulSoup for me. 提议的答案似乎都不适合我的BeautifulSoup。 Here's a version that works with BeautifulSoup 3.2.1, and also inserts a space when joining content from different tags instead of concatenating words. 这是一个与BeautifulSoup 3.2.1一起使用的版本,并且在连接来自不同标签的内容时也插入空格而不是连接单词。

def strip_tags(html, whitelist=[]):
    """
    Strip all HTML tags except for a list of whitelisted tags.
    """
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name not in whitelist:
            tag.append(' ')
            tag.replaceWithChildren()

    result = unicode(soup)

    # Clean up any repeated spaces and spaces like this: '<a>test </a> '
    result = re.sub(' +', ' ', result)
    result = re.sub(r' (<[^>]*> )', r'\1', result)
    return result.strip()

Example: 例:

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a'])
# result: u'<a>test</a> testing again'

Use unwrap. 使用展开。

Unwrap will remove one of multiple occurrence of the tag and still keep the contents. 展开将删除标签的多次出现之一并仍然保留内容。

Example: 例:

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

Here is the better solution without any hassles and boilerplate code to filter out the tags keeping the content.Lets say you want to remove any children tags within the parent tag and just want to keep the contents/text then,you can simply do: 这是更好的解决方案,没有任何麻烦和样板代码来过滤掉保留内容的标签。让我们说你要删除父标签中的任何子标签,只想保留内容/文本,你可以简单地做:

for p_tags in div_tags.find_all("p"):
    print(p_tags.get_text())

That's it and you can be free with all the br or ib tags within the parent tags and get the clean text. 就是这样,您可以使用父标签中的所有br或ib标签免费获得干净的文本。

This is an old question, but just to say of a better ways to do it. 这是一个老问题,但只是说更好的方法。 First of all, BeautifulSoup 3* is no longer being developed, so you should rather use BeautifulSoup 4*, so called bs4 . 首先,BeautifulSoup 3 *不再开发,所以你应该使用BeautifulSoup 4 *,所谓的bs4

Also, lxml has just function that you need: Cleaner class has attribute remove_tags , which you can set to tags that will be removed while their content getting pulled up into the parent tag. 此外,lxml只具有您需要的功能: Cleaner类具有属性remove_tags ,您可以将其设置为在内容被拉入父标记时将被删除的标记。

Here is a python 3 friendly version of this function: 这是这个函数的python 3友好版本:

from bs4 import BeautifulSoup, NavigableString
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM