When I use the get_text() function from the BeautifulSoup module in Python it returns text with line breaks that match the HTML source.
However, I want the line breaks to mimic what you would see in a browser (for example, ignore line breaks in the HTML source, one line break for a <br>
tag, two line breaks between <p>
tags).
from bs4 import BeautifulSoup
some_html = """<p>Some
sample html<br>
new line
<p>New paragraph"""
plain_text = BeautifulSoup(some_html,"html.parser").get_text()
Expected result:
Some sample html
new line
New paragraph
Actual result:
Some
sample html
new line
New paragraph
I ended up using a few substitutions. It worked for the HTML I was working with.
from bs4 import BeautifulSoup
sample = """<p>Some
sample html<br>
new line
<p>New paragraph"""
# Remove all line breaks in the source
sample_remove_line_breaks = re.sub(r'\r?\n', ' ', sample)
# Add line breaks for each `<br>` and `<p>` tag
sample_add_html_line_breaks = re.sub(r'<p>', '\n\n<p>', re.sub(r'<br>', '<br>\n', sample_remove_line_breaks))
plain_text = BeautifulSoup(sample_add_html_line_breaks,"html.parser").get_text()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.