简体   繁体   English

如何在保留一些内部标签的同时获取此 html 元素的文本

[英]How to get text of this html element while preserving some inner tags

I'm using BeautifulSoup and have found an element in my document like so:我正在使用 BeautifulSoup 并在我的文档中找到了一个元素,如下所示:

<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>

and I'd like to extract我想提取

Hershey's<sup>®</sup> makes yummy chocolate

I know I can take this item and grab its .contents , and then re-join the text if it doesn't include an <a> , but that seems like a super hacky way to do it.我知道我可以使用这个项目并获取它的.contents ,然后如果它不包含<a> ,则重新加入文本,但这似乎是一种超级hacky的方法。 How else might I get this text?我还能如何获得此文本? Using methods like get_text() return me the text but without the <sup> tags which I'd like to preserve.使用get_text()类的方法返回文本但没有我想保留的<sup>标签。

You can use next_siblings :您可以使用next_siblings

from bs4 import BeautifulSoup

html = """<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>"""
soup = BeautifulSoup(html, "html.parser")

print(
    "".join(str(x) for x in soup.find("a", id="_Toc374204469").next_siblings)
)

Output:输出:

Hershey's<sup>®</sup> makes yummy chocolate

The best solution I've found thusfar is to use the bleach package.迄今为止我发现的最佳解决方案是使用bleach包。 With that, I can just do有了这个,我就可以了

import bleach
bleach.clean(my_html, tags=['sup'], strip=True)

This wasn't working for me at first because my html was a BeautifulSoup Tag object, and bleach wants the html.起初这对我不起作用,因为我的 html 是一个 BeautifulSoup Tag对象,而漂白剂需要 html。 So I just did str(Tag) to get the html representation and fed that to bleach.所以我只是做了str(Tag)来获取 html 表示并将其喂给漂白剂。

Here is the desired solution so far这是迄今为止所需的解决方案

from bs4 import BeautifulSoup

html_doc="""
<p><a id="_Toc374204393"></a><a id="_Toc374204469"></a>Hershey's<sup>®</sup> makes yummy chocolate</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

s = soup.find('p')
for t in s.select('p a'):
    t.decompose()

print(s)

Output:输出:

<p>Hershey's<sup>®</sup> makes yummy chocolate</p>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM