How to get text of this html element while preserving some inner tags

Question

I'm using BeautifulSoup and have found an element in my document like so:

<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>

and I'd like to extract

Hershey's<sup>®</sup> makes yummy chocolate

I know I can take this item and grab its .contents , and then re-join the text if it doesn't include an <a> , but that seems like a super hacky way to do it. How else might I get this text? Using methods like get_text() return me the text but without the <sup> tags which I'd like to preserve.

Answer 1

You can use next_siblings :

from bs4 import BeautifulSoup

html = """<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>"""
soup = BeautifulSoup(html, "html.parser")

print(
    "".join(str(x) for x in soup.find("a", id="_Toc374204469").next_siblings)
)

Output:

Hershey's<sup>®</sup> makes yummy chocolate

Answer 2

The best solution I've found thusfar is to use the bleach package. With that, I can just do

import bleach
bleach.clean(my_html, tags=['sup'], strip=True)

This wasn't working for me at first because my html was a BeautifulSoup Tag object, and bleach wants the html. So I just did str(Tag) to get the html representation and fed that to bleach.

Answer 3

Here is the desired solution so far

from bs4 import BeautifulSoup

html_doc="""
<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

s = soup.find('p')
for t in s.select('p a'):
    t.decompose()

print(s)

Output:

<p>Hershey's<sup>®</sup> makes yummy chocolate</p>

How to get text of this html element while preserving some inner tags

Question

2 answers

solution1
0 2021-10-27 21:47:04

solution2
0 2021-10-27 22:36:09

solution3
0 2021-10-27 22:37:20

How to get text of this html element while preserving some inner tags

Question

2 answers

solution1 0 2021-10-27 21:47:04

solution2 0 2021-10-27 22:36:09

solution3 0 2021-10-27 22:37:20

solution1
0 2021-10-27 21:47:04

solution2
0 2021-10-27 22:36:09

solution3
0 2021-10-27 22:37:20