[英]Python beautifulsoup remove self closing tag
我正在嘗試通過使用beautifulsoup從html代碼中刪除br
標簽。
html例如:
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br></span>
我的python代碼:
for link2 in soup.find_all('br'):
link2.extract()
for link2 in soup.findAll('span',{'class':'qualification'}):
print(link2.string)
問題在於先前的代碼僅獲得第一條件。
因為這些<br>
都沒有封閉的副本,因此Beautiful Soup會自動添加它們,從而生成以下HTML:
In [23]: soup = BeautifulSoup(html)
In [24]: soup.br
Out[24]:
<br>
Master of Science (Computer Science), Government College University Lahore
<br>
Master of Science ( Computer Science ), University of Agriculture Faisalabad
<br>
Bachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad
<br/></br></br></br>
在第一個<br>
標記上調用Tag.extract
,將刪除其所有后代和后代包含的字符串:
In [27]: soup
Out[27]:
<span class="qualification" style="font-size:14px; font-family: Helvetica, sans-serif;">
Doctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas
</span>
看來您只需要從span
元素中提取所有文本即可。 如果是這種情況,請不要刪除任何內容:
In [28]: soup.span.text
Out[28]: '\nDoctor of Philosophy ( Software Engineering ), Universiti Teknologi Petronas\n\nMaster of Science (Computer Science), Government College University Lahore\n\nMaster of Science ( Computer Science ), University of Agriculture Faisalabad\n\nBachelor of Science (Hons) ( Agriculture ),University of Agriculture Faisalabad\n'
Tag.text
屬性從給定標簽中提取所有字符串。
使用解包應該工作
soup = BeautifulSoup(html)
for match in soup.findAll('br'):
match.unwrap()
這是一種方法:
for link2 in soup.findAll('span',{'class':'qualification'}):
for s in link2.stripped_strings:
print(s)
無需刪除<br>
標記,除非您要求將其刪除以進行后續處理。 這里的link2.stripped_strings
是一個生成器,它產生標記中的每個字符串,並去除前導和尾隨空格。 打印循環可以更簡潔地寫為:
for link2 in soup.findAll('span',{'class':'qualification'}):
print(*link2.stripped_strings, sep='\n')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.