[英]Remove All html tag except one tag by BeautifulSoup
我需要從頁面中提取所有文本和<a>
標簽,但我不知道該怎么做。 這是我到目前為止所擁有的:
from bs4 import BeautifulSoup
def cleanMe(html):
soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.decompose()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text with this <a href="http://example.com/">link</a> captured.</body>"
cleaned = cleanMe(testhtml)
print (cleaned)
輸出:
THIS IS AN EXAMPLE I need this text with this link captured.
我想要的輸出:
THIS IS AN EXAMPLE I need this text with this <a href="http://example.com/">link</a> captured.
考慮使用除 BeautifulSoup 之外的另一個庫。 我用這個:
from bleach import clean
def strip_html(self, src, allowed=['a']):
return clean(src, tags=allowed, strip=True, strip_comments=True)
考慮以下:-
def cleanMe(html):
soup = BeautifulSoup(html,'html.parser') # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.decompose()
# get text
text = soup.get_text()
for link in soup.find_all('a'):
if 'href' in link.attrs:
repl=link.get_text()
href=link.attrs['href']
link.clear()
link.attrs={}
link.attrs['href']=href
link.append(repl)
text=re.sub(repl+'(?!= *?</a>)',str(link),text,count=1)
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
我們所做的新工作如下
for link in soup.find_all('a'):
text=re.sub(link.get_text()+'(?!= *?</a>)',str(link),text,count=1)
對於每組錨標記,將錨( link
)中的文本替換為整個錨本身。 請注意,我們只對第一個出現的link
文本進行一次替換。
正則表達式link.get_text()+'(?!= *?</a>)'
確保我們只替換尚未替換的link
文本。
(?!= *?</a>)
是一個否定的前瞻,它避免了沒有附加</a>
任何link
。
但這並不是最傻瓜的方法。 最簡單的方法是遍歷每個標簽並取出文本。
在此處查看工作代碼
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.