刪除BeautifulSoup除一個標簽之外的所有html標簽

Question

我需要從頁面中提取所有文本和<a>標簽，但我不知道該怎么做。 這是我到目前為止所擁有的：

from bs4 import BeautifulSoup

def cleanMe(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
    script.decompose()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text with this <a href="http://example.com/">link</a> captured.</body>"
cleaned = cleanMe(testhtml)
print (cleaned)

輸出：

THIS IS AN EXAMPLE I need this text with this link captured.

我想要的輸出：

THIS IS AN EXAMPLE I need this text with this <a href="http://example.com/">link</a> captured.

Answer 1

考慮使用除 BeautifulSoup 之外的另一個庫。 我用這個：

from bleach import clean

def strip_html(self, src, allowed=['a']):
    return clean(src, tags=allowed, strip=True, strip_comments=True)

Answer 2

考慮以下：-

def cleanMe(html):
    soup = BeautifulSoup(html,'html.parser') # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.decompose()
    # get text
    text = soup.get_text()
    for link in soup.find_all('a'):
        if 'href' in link.attrs:
            repl=link.get_text()
            href=link.attrs['href']
            link.clear()
            link.attrs={}
            link.attrs['href']=href
            link.append(repl)
            text=re.sub(repl+'(?!= *?</a>)',str(link),text,count=1)

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

我們所做的新工作如下

    for link in soup.find_all('a'):
        text=re.sub(link.get_text()+'(?!= *?</a>)',str(link),text,count=1)

對於每組錨標記，將錨（ link ）中的文本替換為整個錨本身。 請注意，我們只對第一個出現的link文本進行一次替換。

正則表達式link.get_text()+'(?!= *?</a>)'確保我們只替換尚未替換的link文本。

(?!= *?</a>)是一個否定的前瞻，它避免了沒有附加</a>任何link 。

但這並不是最傻瓜的方法。 最簡單的方法是遍歷每個標簽並取出文本。

在此處查看工作代碼

刪除BeautifulSoup除一個標簽之外的所有html標簽

問題描述

2 個解決方案

解決方案1
7 2018-05-19 17:56:10

解決方案2
0 已采納 2017-10-14 04:41:25

刪除BeautifulSoup除一個標簽之外的所有html標簽

問題描述

2 個解決方案

解決方案1 7 2018-05-19 17:56:10

解決方案2 0 已采納 2017-10-14 04:41:25

解決方案1
7 2018-05-19 17:56:10

解決方案2
0 已采納 2017-10-14 04:41:25