Python beautifulsoup刪除自動關閉標簽

Question

我正在嘗試通過使用beautifulsoup從html代碼中刪除br標簽。

html例如：

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>

我的python代碼：

 for link2 in soup.find_all('br'):
        link2.extract()
 for link2 in soup.findAll('span',{'class':'qualification'}):
        print(link2.string)

問題在於先前的代碼僅獲得第一條件。

Answer 1

因為這些<br>都沒有封閉的副本，因此Beautiful Soup會自動添加它們，從而生成以下HTML：

In [23]: soup = BeautifulSoup(html)

In [24]: soup.br
Out[24]: 
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>

在第一個<br>標記上調用Tag.extract ，將刪除其所有后代和后代包含的字符串：

In [27]: soup
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>

看來您只需要從span元素中提取所有文本即可。 如果是這種情況，請不要刪除任何內容：

In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'

Tag.text屬性從給定標簽中提取所有字符串。

Answer 2

使用解包應該工作

soup = BeautifulSoup(html)
for match in soup.findAll('br'):
    match.unwrap()

Answer 3

這是一種方法：

for link2 in soup.findAll('span',{'class':'qualification'}):
    for s in link2.stripped_strings:
        print(s)

無需刪除<br>標記，除非您要求將其刪除以進行后續處理。 這里的link2.stripped_strings是一個生成器，它產生標記中的每個字符串，並去除前導和尾隨空格。 打印循環可以更簡潔地寫為：

for link2 in soup.findAll('span',{'class':'qualification'}):
    print(*link2.stripped_strings, sep='\n')

Python beautifulsoup刪除自動關閉標簽

問題描述

3 個解決方案

解決方案1
1 2016-07-27 10:21:30

解決方案2
0 2016-07-27 10:22:04

解決方案3
0 2016-07-27 10:37:38

Python beautifulsoup刪除自動關閉標簽

問題描述

3 個解決方案

解決方案1 1 2016-07-27 10:21:30

解決方案2 0 2016-07-27 10:22:04

解決方案3 0 2016-07-27 10:37:38

解決方案1
1 2016-07-27 10:21:30

解決方案2
0 2016-07-27 10:22:04

解決方案3
0 2016-07-27 10:37:38