如何使用 python 从 html 中删除标签

Question

I have html code like this:我有这样的 html 代码：

<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>

I tried removing div tags with this function:我尝试使用此 function 删除 div 标签：

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    print(cleantext)

and with this code:并使用此代码：

cleantext = BeautifulSoup(raw_html, "html.parser").text

But the program returns:但程序返回：

caused error "expected string or bytes-like object"

Then how to remove the div tags?那么如何去除 div 标签呢？ I want the text in the format:我想要以下格式的文本：

A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational

Type of my raw_html is:我的 raw_html 类型是：

<class 'bs4.element.ResultSet'>

Answer 1

To get text from <div> tags, use .text or .get_text() .要从<div>标签获取文本，请使用.text或.get_text() 。 For example:例如：

from bs4 import BeautifulSoup

html_doc = """
<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")
raw_html = soup.select("div.qtext, div.rightanswer")  # <-- raw_html is ResultSet

s = "\n".join(div.get_text(strip=True) for div in raw_html)
print(s)

Prints:印刷：

A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational
Selling new products to existing customers is a strategy of
The correct answer is: Product development
In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the
The correct answer is: Industry and competitor analysis

如何使用 python 从 html 中删除标签

问题描述

1 个解决方案

解决方案1
0 2021-06-10 16:47:34

如何使用 python 从 html 中删除标签

问题描述

1 个解决方案

解决方案1 0 2021-06-10 16:47:34

解决方案1
0 2021-06-10 16:47:34