簡體   English   中英

Python beautifulsoup刪除自動關閉標簽

[英]Python beautifulsoup remove self closing tag

我正在嘗試通過使用beautifulsoup從html代碼中刪除br標簽。

html例如:

<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>

我的python代碼:

 for link2 in soup.find_all('br'):
        link2.extract()
 for link2 in soup.findAll('span',{'class':'qualification'}):
        print(link2.string)

問題在於先前的代碼僅獲得第一條件。

因為這些<br>都沒有封閉的副本,因此Beautiful Soup會自動添加它們,從而生成以下HTML:

In [23]: soup = BeautifulSoup(html)

In [24]: soup.br
Out[24]: 
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>

在第一個<br>標記上調用Tag.extract ,將刪除其所有后代和后代包含的字符串:

In [27]: soup
Out[27]: 
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>

看來您只需要從span元素中提取所有文本即可。 如果是這種情況,請不要刪除任何內容:

In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'

Tag.text屬性從給定標簽中提取所有字符串。

使用解包應該工作

soup = BeautifulSoup(html)
for match in soup.findAll('br'):
    match.unwrap()

這是一種方法:

for link2 in soup.findAll('span',{'class':'qualification'}):
    for s in link2.stripped_strings:
        print(s)

無需刪除<br>標記,除非您要求將其刪除以進行后續處理。 這里的link2.stripped_strings是一個生成器,它產生標記中的每個字符串,並去除前導和尾隨空格。 打印循環可以更簡潔地寫為:

for link2 in soup.findAll('span',{'class':'qualification'}):
    print(*link2.stripped_strings, sep='\n')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM