如何使用python在html中的標簽之間刪除pilcrow sign（¶）

Question

我試圖通過刪除必需的屬性來抓取html頁面。我能夠刪除內容為空但被pilcrow符號刪除卡住的標簽

input: `<h2>Tutorial material<a>¶</a></h2>

預期產量：

 <h2>Tutorial material<a></a></h2>

碼：

elements = soup.find_all(True)
 for el in elements:
    if len(el.text) == 0:
        el.extract()
print soup

此代碼刪除內容為空的標簽，但我無法刪除pilcrow標志

`

Answer 1

嘗試添加

#!/usr/bin/env python
# -*- coding: utf-8 -*-

到python文件的開頭，並在需要時將pilcrow符號稱為u'¶' 。

Answer 2

您提供的代碼刪除了空節點，您只需要對其進行適應以包含@Robin注釋即可。

一種解決方案是檢查節點文本是否為空或等於¶，並將其刪除：

elements = soup.find_all(True)
 for el in elements:
    if len(el.text) == 0 or el.text == u'¶':
        el.extract()
print soup

Answer 3

檢查文本是否（僅）是pilcrow：

elements = soup.find_all(True)
for el in elements:
    if len(el.text) == 0 or el.text == u'¶':
        el.extract()
print(soup)