简体   繁体   中英

How to get text with line breaks that match the browser view rather than the html source (using python and beautifulsoup)

When I use the get_text() function from the BeautifulSoup module in Python it returns text with line breaks that match the HTML source.

However, I want the line breaks to mimic what you would see in a browser (for example, ignore line breaks in the HTML source, one line break for a <br> tag, two line breaks between <p> tags).

from bs4 import BeautifulSoup

some_html = """<p>Some
sample html<br>
new line
<p>New paragraph"""

plain_text = BeautifulSoup(some_html,"html.parser").get_text()

Expected result:

Some sample html
new line

New paragraph

Actual result:

Some 
sample html
new line
New paragraph

I ended up using a few substitutions. It worked for the HTML I was working with.

from bs4 import BeautifulSoup

sample = """<p>Some
sample html<br>
new line
<p>New paragraph"""

# Remove all line breaks in the source
sample_remove_line_breaks = re.sub(r'\r?\n', ' ', sample)

# Add line breaks for each `<br>` and `<p>` tag
sample_add_html_line_breaks = re.sub(r'<p>', '\n\n<p>', re.sub(r'<br>', '<br>\n', sample_remove_line_breaks))

plain_text = BeautifulSoup(sample_add_html_line_breaks,"html.parser").get_text()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM