[英]Remove HTML tag (Python)
I have the following HTML code我有以下 HTML 代码
<span><s>Something</s>Anything</span>
I would like to remove the span tag, returning the HTML code我想删除 span 标签,返回 HTML 代码
<s>Something></s>
I am using the beautifulsoup library我正在使用 beautifulsoup 库
soup = BeautifulSoup(HTML, 'lxml')
soup.span.unwrap()
But that returns -> <s>Something</s>Anything
但这会返回 ->
<s>Something</s>Anything
If all you want is the <s>
part, why not just filter that out instead of removing?如果你想要的只是
<s>
部分,为什么不直接过滤掉而不是删除呢?
For example:例如:
from bs4 import BeautifulSoup
sample = """
<span><s>Something</s>Anything</span>
"""
soup = BeautifulSoup(sample, "lxml")
print(soup.find("s"))
This gets you:这让你:
<s>Something</s>
Should you have more of those <span>
tags with <s>
inside, you could go for something like this:如果您有更多带有
<s>
的<span>
标记,您可以使用 go 进行如下操作:
sample = """
<span><s>Something</s>Anything</span>
<span><s>More of Something</s>Less of Anything</span>
"""
print([t.find("s") for t in soup.find_all("span")])
To get this:要得到这个:
[<s>Something</s>, <s>More of Something</s>]
However, if you want to remove the tags, then you'll end up with an empty HTML
(at least in this simple case).但是,如果你想删除标签,那么你最终会得到一个空的
HTML
(至少在这个简单的情况下)。
See this:看到这个:
from bs4 import BeautifulSoup
sample = """
<span><s>Something</s>Anything</span>
"""
soup = BeautifulSoup(sample, "lxml")
for tag in soup.find_all(True):
if tag.name == "span":
tag.extract()
print(soup)
Produces this:产生这个:
<html><head></head><body>
</body></html>
Or, shorter, with a list comprehension:或者,更短的,使用列表理解:
print([t.extract() for t in soup.find_all("span")])
Gives: []
给出:
[]
So, I guess, your best bet is to filter the unwanted tags out.所以,我想,你最好的选择是过滤掉不需要的标签。
I tried the following code:我尝试了以下代码:
s1 = soup.span.s
soup.span.replaceWith(s1)
print(soup)
Output: Output:
<html><body><s>Something</s></body></html>
you want <s>
tags or innerHTML of <span>
?你想要
<s>
标签或<span>
的 innerHTML 吗?
first answer give you a code to get <s>
tags or <s>Something></s>
第一个答案给你一个代码来获取
<s>
标签或<s>Something></s>
To get innerHTML of <span>
or value <s>Something</s>Anything
use要获取
<span>
的 innerHTML 或值<s>Something</s>Anything
使用
spanTag.decode_contents()
from bs4 import BeautifulSoup
with open('home.html', 'r') as html_file:
content = html_file.read()
soup = BeautifulSoup(content, 'lxml')
print(soup.prettify()) # this part makes the output look better
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.