使用BeautifulSoup刪除標記但保留其內容

Question

目前我的代碼執行如下操作：

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.extract()
soup.renderContents()

除了我不想丟棄無效標簽內的內容。 如何在刪除標簽但在調用soup.renderContents（）時保留內容？

Answer 1

當前版本的BeautifulSoup庫在Tag對象上有一個名為replaceWithChildren（）的未記錄方法。 所以，你可以這樣做：

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
soup = BeautifulSoup(html)
for tag in invalid_tags: 
    for match in soup.findAll(tag):
        match.replaceWithChildren()
print soup

看起來它的行為就像你想要的那樣，並且是相當簡單的代碼（盡管它確實通過DOM進行了一些傳遞，但這可以很容易地進行優化。）

Answer 2

我使用的策略是將標簽替換為其內容，如果它們是NavigableString類型，如果它們不是，則將它們遞歸到它們中並用NavigableString替換它們的內容等。試試這個：

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

結果是：

<p>Good, bad, and ugly</p>

我在另一個問題上給出了同樣的答案。 它似乎出現了很多。

Answer 3

雖然評論中已經有其他人提到了這一點，但我想我會發布一個完整的答案，展示如何使用Mozilla的Bleach。 就個人而言，我認為這比使用BeautifulSoup要好得多。

import bleach
html = "<b>Bad</b> <strong>Ugly</strong> <script>Evil()</script>"
clean = bleach.clean(html, tags=[], strip=True)
print clean # Should print: "Bad Ugly Evil()"

Answer 4

我有一個更簡單的解決方案，但我不知道它是否有缺點。

更新：有一個缺點，請參閱Jesse Dhillon的評論。 另外，另一種解決方案是使用Mozilla的Bleach而不是BeautifulSoup。

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

這也將根據需要打印<div><p>Hello there my friend!</p></div> 。

Answer 5

你可以使用soup.text

.text刪除所有標記並連接所有文本。

Answer 6

在刪除標簽之前，您可能必須將標簽的子項移動為標記父項的子項 - 這是您的意思嗎？

如果是這樣，那么，雖然在正確的位置插入內容是棘手的，這樣的事情應該工作：

from BeautifulSoup import BeautifulSoup

VALID_TAGS = 'div', 'p'

value = '<div><p>Hello <b>there</b> my friend!</p></div>'

soup = BeautifulSoup(value)

for tag in soup.findAll(True):
    if tag.name not in VALID_TAGS:
        for i, x in enumerate(tag.parent.contents):
          if x == tag: break
        else:
          print "Can't find", tag, "in", tag.parent
          continue
        for r in reversed(tag.contents):
          tag.parent.insert(i, r)
        tag.extract()
print soup.renderContents()

使用示例值，根據需要打印<div><p>Hello there my friend!</p></div> 。

Answer 7

提議的答案似乎都不適合我的BeautifulSoup。 這是一個與BeautifulSoup 3.2.1一起使用的版本，並且在連接來自不同標簽的內容時也插入空格而不是連接單詞。

def strip_tags(html, whitelist=[]):
    """
    Strip all HTML tags except for a list of whitelisted tags.
    """
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name not in whitelist:
            tag.append(' ')
            tag.replaceWithChildren()

    result = unicode(soup)

    # Clean up any repeated spaces and spaces like this: '<a>test </a> '
    result = re.sub(' +', ' ', result)
    result = re.sub(r' (<[^>]*> )', r'\1', result)
    return result.strip()

例：

strip_tags('<h2><a><span>test</span></a> testing</h2><p>again</p>', ['a'])
# result: u'<a>test</a> testing again'

Answer 8

使用展開。

展開將刪除標簽的多次出現之一並仍然保留內容。

例：

>> soup = BeautifulSoup('Hi. This is a <nobr> nobr </nobr>')
>> soup
<html><body><p>Hi. This is a <nobr> nobr </nobr></p></body></html>
>> soup.nobr.unwrap
<nobr></nobr>
>> soup
>> <html><body><p>Hi. This is a nobr </p></body></html>

Answer 9

這是更好的解決方案，沒有任何麻煩和樣板代碼來過濾掉保留內容的標簽。讓我們說你要刪除父標簽中的任何子標簽，只想保留內容/文本，你可以簡單地做：

for p_tags in div_tags.find_all("p"):
    print(p_tags.get_text())

就是這樣，您可以使用父標簽中的所有br或ib標簽免費獲得干凈的文本。

Answer 10

這是一個老問題，但只是說更好的方法。 首先，BeautifulSoup 3 *不再開發，所以你應該使用BeautifulSoup 4 *，所謂的bs4 。

此外，lxml只具有您需要的功能： Cleaner類具有屬性remove_tags ，您可以將其設置為在內容被拉入父標記時將被刪除的標記。

Answer 11

這是這個函數的python 3友好版本：

from bs4 import BeautifulSoup, NavigableString
invalidTags = ['br','b','font']
def stripTags(html, invalid_tags):
    soup = BeautifulSoup(html, "lxml")
    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""
            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = stripTags(str(c), invalid_tags)
                s += str(c)
            tag.replaceWith(s)
    return soup

使用BeautifulSoup刪除標記但保留其內容

問題描述

11 個解決方案

解決方案1
67 2011-12-09 00:47:21

解決方案2
55 已采納 2010-07-12 03:25:02

解決方案3
17 2012-10-20 15:22:36

解決方案4
10 2009-11-20 03:43:13

解決方案5
8 2013-12-23 06:08:05

解決方案6
6 2009-11-19 19:42:02

解決方案7
2 2013-04-22 10:04:54

解決方案8
2 2016-12-26 09:11:30

解決方案9
1 2016-09-25 17:13:35

解決方案10
0 2015-03-12 01:51:11

解決方案11
0 2019-06-01 14:04:25

使用BeautifulSoup刪除標記但保留其內容

問題描述

11 個解決方案

解決方案1 67 2011-12-09 00:47:21

解決方案2 55 已采納 2010-07-12 03:25:02

解決方案3 17 2012-10-20 15:22:36

解決方案4 10 2009-11-20 03:43:13

解決方案5 8 2013-12-23 06:08:05

解決方案6 6 2009-11-19 19:42:02

解決方案7 2 2013-04-22 10:04:54

解決方案8 2 2016-12-26 09:11:30

解決方案9 1 2016-09-25 17:13:35

解決方案10 0 2015-03-12 01:51:11

解決方案11 0 2019-06-01 14:04:25

解決方案1
67 2011-12-09 00:47:21

解決方案2
55 已采納 2010-07-12 03:25:02

解決方案3
17 2012-10-20 15:22:36

解決方案4
10 2009-11-20 03:43:13

解決方案5
8 2013-12-23 06:08:05

解決方案6
6 2009-11-19 19:42:02

解決方案7
2 2013-04-22 10:04:54

解決方案8
2 2016-12-26 09:11:30

解決方案9
1 2016-09-25 17:13:35

解決方案10
0 2015-03-12 01:51:11

解決方案11
0 2019-06-01 14:04:25