简体   繁体   English

删除 HTML 标签(Python)

[英]Remove HTML tag (Python)

I have the following HTML code我有以下 HTML 代码

<span><s>Something</s>Anything</span>

I would like to remove the span tag, returning the HTML code我想删除 span 标签,返回 HTML 代码

<s>Something></s>

I am using the beautifulsoup library我正在使用 beautifulsoup 库

soup = BeautifulSoup(HTML, 'lxml')
soup.span.unwrap()

But that returns -> <s>Something</s>Anything但这会返回 -> <s>Something</s>Anything

If all you want is the <s> part, why not just filter that out instead of removing?如果你想要的只是<s>部分,为什么不直接过滤掉而不是删除呢?

For example:例如:

from bs4 import BeautifulSoup

sample = """
<span><s>Something</s>Anything</span>
"""

soup = BeautifulSoup(sample, "lxml")
print(soup.find("s"))

This gets you:这让你:

<s>Something</s>

Should you have more of those <span> tags with <s> inside, you could go for something like this:如果您有更多带有<s><span>标记,您可以使用 go 进行如下操作:

sample = """
<span><s>Something</s>Anything</span>
<span><s>More of Something</s>Less of Anything</span>
"""
print([t.find("s") for t in soup.find_all("span")])

To get this:要得到这个:

[<s>Something</s>, <s>More of Something</s>]

However, if you want to remove the tags, then you'll end up with an empty HTML (at least in this simple case).但是,如果你想删除标签,那么你最终会得到一个空的HTML (至少在这个简单的情况下)。

See this:看到这个:

from bs4 import BeautifulSoup

sample = """
<span><s>Something</s>Anything</span>
"""

soup = BeautifulSoup(sample, "lxml")

for tag in soup.find_all(True):
    if tag.name == "span":
        tag.extract()
print(soup)

Produces this:产生这个:

<html><head></head><body>
</body></html>

Or, shorter, with a list comprehension:或者,更短的,使用列表理解:

print([t.extract() for t in soup.find_all("span")])

Gives: []给出: []

So, I guess, your best bet is to filter the unwanted tags out.所以,我想,你最好的选择是过滤掉不需要的标签。

I tried the following code:我尝试了以下代码:

s1 = soup.span.s
soup.span.replaceWith(s1)
print(soup)

Output: Output:

<html><body><s>Something</s></body></html>

you want <s> tags or innerHTML of <span> ?你想要<s>标签或<span>的 innerHTML 吗?

first answer give you a code to get <s> tags or <s>Something></s>第一个答案给你一个代码来获取<s>标签或<s>Something></s>

To get innerHTML of <span> or value <s>Something</s>Anything use要获取<span>的 innerHTML 或值<s>Something</s>Anything使用

spanTag.decode_contents()
from bs4 import BeautifulSoup

with open('home.html', 'r') as html_file:
   content = html_file.read()

   soup = BeautifulSoup(content, 'lxml')
   print(soup.prettify()) # this part makes the output look better

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM