python刪除html標簽，包括html實體，但不包括帶有'＆'前綴的普通文本

Question

我想刪除html標簽，包括&等html實體& 在python 2.7中，但是我的輸入文本包含以字母&開頭的普通文本，我不想刪除此類文本。 我正在嘗試這篇文章中投票率最高的答案：從Python中的字符串中剝離HTML 。 唯一的區別是，我用space替換了html標簽。

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ' '.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

print strip_tags('html tags<p>will be&amp;replaced</p>with space. NOT this &abc')
# Now the output is:  "html tags will be replaced with space. NOT this  "
# The wanted output is:  "html tags will be replaced with space. NOT this &abc"

如何輸出正確的文字？

Answer 1

您可以嘗試BeautifulSoup ：

>>> html = '<div><p>&abc is <b>my</b> input text</p></div>'
>>> print strip_tags(html)
 is  my  input text

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> print soup.text
&abc is my input text
>>> soup = BeautifulSoup('=&abc= is my input text')
>>> soup.text
u'=&abc= is my input text'

請注意，您的strip_tags()不能正確剝離我添加到測試字符串中的嵌套<b>標記。

如果您想繼續使用標准HTMLParser，則可以通過鏈接到該問題的另一個答案來做得更好。 對於我的測試字符串，它將輸出&abc; is my input text &abc; is my input text ，即它會脫離獨立的& 。 我不確定您要輸出哪個。

更新

這有效：

import re
from HTMLParser import HTMLParser
from htmlentitydefs import entitydefs

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
        self.entityref = re.compile('&[a-zA-Z][-.a-zA-Z0-9]*[^a-zA-Z0-9]')

    def handle_data(self, d):
        self.fed.append(d)

    def handle_starttag(self, tag, attrs):
        self.fed.append(' ')

    def handle_endtag(self, tag):
        self.fed.append(' ')

    def handle_entityref(self, name):
        if entitydefs.get(name) is None:
            m = self.entityref.match(self.rawdata.splitlines()[self.lineno-1][self.offset:])
            entity = m.group()
            # semicolon is consumed, other chars are not.
            if entity[-1] != ';':
                entity = entity[:-1]
            self.fed.append(entity)
        else:
            self.fed.append(' ')

    def get_data(self):
        self.close()    # N.B. ensure all buffered data has been processed
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

print strip_tags('html &zzz; tags<p>&zzz &zz: will be&amp;replaced</p>with space. NOT this &abc')

產量

html &zzz; tags &zzz &zz: will be replaced with space. NOT this &abc

這段代碼為開始和結束標記添加了處理程序，這些處理程序被一個空格替換。 實體引用也通過用空格替換已知有效引用，並使未知引用保持不變來處理。

另一個重要的問題是在調用get_data()之前在解析器上調用close() get_data() 。 我將其放在get_data()方法中，盡管您可以將其添加到strip_tags()函數中。 我認為多次調用close()並不重要，因此您可以調用get_data()然后將更多數據提供給解析器。

python刪除html標簽，包括html實體，但不包括帶有'＆'前綴的普通文本

問題描述

1 個解決方案

解決方案1
2 已采納 2015-09-05 01:23:47

python刪除html標簽，包括html實體，但不包括帶有&#39;＆&#39;前綴的普通文本

問題描述

1 個解決方案

解決方案1 2 已采納 2015-09-05 01:23:47

python刪除html標簽，包括html實體，但不包括帶有'＆'前綴的普通文本

解決方案1
2 已采納 2015-09-05 01:23:47