[英]How to remove tags from html using python
I have html code like this:我有这样的 html 代码:
<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>
I tried removing div tags with this function:我尝试使用此 function 删除 div 标签:
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
print(cleantext)
and with this code:并使用此代码:
cleantext = BeautifulSoup(raw_html, "html.parser").text
But the program returns:但程序返回:
caused error "expected string or bytes-like object"
Then how to remove the div tags?那么如何去除 div 标签呢? I want the text in the format:
我想要以下格式的文本:
A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational
Type of my raw_html is:我的 raw_html 类型是:
<class 'bs4.element.ResultSet'>
To get text from <div>
tags, use .text
or .get_text()
.要从
<div>
标签获取文本,请使用.text
或.get_text()
。 For example:例如:
from bs4 import BeautifulSoup
html_doc = """
<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
raw_html = soup.select("div.qtext, div.rightanswer") # <-- raw_html is ResultSet
s = "\n".join(div.get_text(strip=True) for div in raw_html)
print(s)
Prints:印刷:
A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational
Selling new products to existing customers is a strategy of
The correct answer is: Product development
In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the
The correct answer is: Industry and competitor analysis
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.