刪除 HTML 標簽（Python）

Question

我有以下 HTML 代碼

<span><s>Something</s>Anything</span>

我想刪除 span 標簽，返回 HTML 代碼

<s>Something></s>

我正在使用 beautifulsoup 庫

soup = BeautifulSoup(HTML, 'lxml')
soup.span.unwrap()

但這會返回 -> <s>Something</s>Anything

Answer 1

如果你想要的只是<s>部分，為什么不直接過濾掉而不是刪除呢？

例如：

from bs4 import BeautifulSoup

sample = """
<span><s>Something</s>Anything</span>
"""

soup = BeautifulSoup(sample, "lxml")
print(soup.find("s"))

這讓你：

<s>Something</s>

如果您有更多帶有<s>的<span>標記，您可以使用 go 進行如下操作：

sample = """
<span><s>Something</s>Anything</span>
<span><s>More of Something</s>Less of Anything</span>
"""
print([t.find("s") for t in soup.find_all("span")])

要得到這個：

[<s>Something</s>, <s>More of Something</s>]

但是，如果你想刪除標簽，那么你最終會得到一個空的HTML （至少在這個簡單的情況下）。

看到這個：

from bs4 import BeautifulSoup

sample = """
<span><s>Something</s>Anything</span>
"""

soup = BeautifulSoup(sample, "lxml")

for tag in soup.find_all(True):
    if tag.name == "span":
        tag.extract()
print(soup)

產生這個：

<html><head></head><body>
</body></html>

或者，更短的，使用列表理解：

print([t.extract() for t in soup.find_all("span")])

給出： []

所以，我想，你最好的選擇是過濾掉不需要的標簽。

Answer 2

我嘗試了以下代碼：

s1 = soup.span.s
soup.span.replaceWith(s1)
print(soup)

Output：

<html><body><s>Something</s></body></html>

Answer 3

你想要<s>標簽或<span>的 innerHTML 嗎？

第一個答案給你一個代碼來獲取<s>標簽或<s>Something></s>

要獲取<span>的 innerHTML 或值<s>Something</s>Anything使用

spanTag.decode_contents()

Answer 4

from bs4 import BeautifulSoup

with open('home.html', 'r') as html_file:
   content = html_file.read()

   soup = BeautifulSoup(content, 'lxml')
   print(soup.prettify()) # this part makes the output look better

刪除 HTML 標簽（Python）

問題描述

4 個解決方案

解決方案1
1 2020-12-20 09:50:37

解決方案2
0 已采納 2020-12-20 20:18:51

解決方案3
0 2020-12-21 05:39:36

解決方案4
-1 2020-12-21 06:38:29

刪除 HTML 標簽（Python）

問題描述

4 個解決方案

解決方案1 1 2020-12-20 09:50:37

解決方案2 0 已采納 2020-12-20 20:18:51

解決方案3 0 2020-12-21 05:39:36

解決方案4 -1 2020-12-21 06:38:29

解決方案1
1 2020-12-20 09:50:37

解決方案2
0 已采納 2020-12-20 20:18:51

解決方案3
0 2020-12-21 05:39:36

解決方案4
-1 2020-12-21 06:38:29