如何刪除BeautifulSoup中的所有不同腳本標記？

Question

我從Web鏈接爬行表，並希望通過刪除所有腳本標記來重建表。 這是源代碼。

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :                                                                                                                                                                                                                                                                                                                                                                                                     
    for col in row.find_all('td'):
        #remove all different script tags
        #col.replace_with('') 
        #col.decompose()  
        #col.extract()
        col = col.contents

如何刪除所有不同的腳本標記？ 以跟隨單元格為例，其中包括標簽a ， br和td 。

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

我的預期結果是：

Signal et Communication
Ingénierie Réseaux et Télécommunications

Answer 1

你問的是get_text() ：

如果只需要文檔或標記的文本部分，則可以使用get_text()方法。 它返回文檔中或標記下的所有文本，作為單個Unicode字符串

td = soup.find("td")
td.get_text()

請注意，在這種情況下， .string將返回None ，因為td 有多 .string ：

如果一個標簽包含多個東西，那么不清楚.string應該引用什么，所以.string被定義為None

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications

Answer 2

嘗試調用col.string。 那只會給你文字。

如何刪除BeautifulSoup中的所有不同腳本標記？

問題描述

2 個解決方案

解決方案1
5 已采納 2015-07-18 17:52:49

解決方案2
1 2015-07-18 17:46:04

如何刪除BeautifulSoup中的所有不同腳本標記？

問題描述

2 個解決方案

解決方案1 5 已采納 2015-07-18 17:52:49

解決方案2 1 2015-07-18 17:46:04

解決方案1
5 已采納 2015-07-18 17:52:49

解決方案2
1 2015-07-18 17:46:04