简体   繁体   English

如何使用 python 从 html 中删除标签

[英]How to remove tags from html using python

I have html code like this:我有这样的 html 代码:

<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>

I tried removing div tags with this function:我尝试使用此 function 删除 div 标签:

def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    print(cleantext)

and with this code:并使用此代码:

cleantext = BeautifulSoup(raw_html, "html.parser").text

But the program returns:但程序返回:

caused error "expected string or bytes-like object"

Then how to remove the div tags?那么如何去除 div 标签呢? I want the text in the format:我想要以下格式的文本:

A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational

Type of my raw_html is:我的 raw_html 类型是:

<class 'bs4.element.ResultSet'>

To get text from <div> tags, use .text or .get_text() .要从<div>标签获取文本,请使用.text.get_text() For example:例如:

from bs4 import BeautifulSoup

html_doc = """
<div class="qtext">A financial model should be developed when the business planning process has reached the _______ planning stage</div>
<div class="rightanswer">The correct answer is: Operational</div>
<div class="qtext">Selling new products to existing customers is a strategy of</div>
<div class="rightanswer">The correct answer is: Product development</div>
<div class="qtext">In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the</div>
<div class="rightanswer">The correct answer is: Industry and competitor analysis</div>
"""

soup = BeautifulSoup(html_doc, "html.parser")
raw_html = soup.select("div.qtext, div.rightanswer")  # <-- raw_html is ResultSet

s = "\n".join(div.get_text(strip=True) for div in raw_html)
print(s)

Prints:印刷:

A financial model should be developed when the business planning process has reached the _______ planning stage
The correct answer is: Operational
Selling new products to existing customers is a strategy of
The correct answer is: Product development
In the strategic review and strategic planning process Product/Portfolio Analysis follows immediately after the
The correct answer is: Industry and competitor analysis

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM