I'm using BeautifulSoup and have found an element in my document like so:
<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>
and I'd like to extract
Hershey's<sup>®</sup> makes yummy chocolate
I know I can take this item and grab its .contents
, and then re-join the text if it doesn't include an <a>
, but that seems like a super hacky way to do it. How else might I get this text? Using methods like get_text()
return me the text but without the <sup>
tags which I'd like to preserve.
You can use next_siblings
:
from bs4 import BeautifulSoup
html = """<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>"""
soup = BeautifulSoup(html, "html.parser")
print(
"".join(str(x) for x in soup.find("a", id="_Toc374204469").next_siblings)
)
Output:
Hershey's<sup>®</sup> makes yummy chocolate
The best solution I've found thusfar is to use the bleach
package. With that, I can just do
import bleach
bleach.clean(my_html, tags=['sup'], strip=True)
This wasn't working for me at first because my html was a BeautifulSoup Tag
object, and bleach wants the html. So I just did str(Tag)
to get the html representation and fed that to bleach.
Here is the desired solution so far
from bs4 import BeautifulSoup
html_doc="""
<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find('p')
for t in s.select('p a'):
t.decompose()
print(s)
Output:
<p>Hershey's<sup>®</sup> makes yummy chocolate</p>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.