[英]Scraping specific part of HTML website ID using beautifulsoup
[英]Scraping html id with beautifulsoup
我在從以下 html 文件中刮取 html ID 時遇到問題,因為有 2 行代碼在14 Jun 2020
6 月 14 日以下沒有任何 ID,這意味着8.15am on 14 June
之后沒有更多的預約空檔, 6月15日恢復。
<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
<thead>
<tr>
<th width="14%" class="text-left nowrap fixed-side">Session Date</th>
<th width="14%" class="text-center">
<b>1</b>
</th>
<th width="14%" class="text-center">
<b>2</b>
</tr>
</thead>
<tbody class="tr-border-bottom">
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
<br> Saturday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
<br> Sunday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
<br> Monday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
</tbody>
</table>
我想出了下面的代碼,但只有在8.15am 14th June 2020 slot
點 15 分之前的約會的 html ID 會被打印出來。 然后在打印8.15am 14 June
8.15 時段的 ID 后遇到TypeError(NoneType object 不可迭代) ,並且沒有打印 6 月 15 日時段的 ID。
for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
tags = slots.find("a")
for IDS in tags:
IDS = tags.attrs["id"]
print (IDS)
我也在這里嘗試了異常處理,但是我遇到了語法錯誤(而且我不太確定我到底做錯了什么)。
for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
tags = slots.find("a")
for IDS in tags:
try:
IDS = tags.attrs["id"]
except TypeError:
else:
print (IDS)
只需檢查標簽是否有帶有id
屬性的標簽,然后打印。
data='''<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
<thead>
<tr>
<th width="14%" class="text-left nowrap fixed-side">Session Date</th>
<th width="14%" class="text-center">
<b>1</b>
</th>
<th width="14%" class="text-center">
<b>2</b>
</tr>
</thead>
<tbody class="tr-border-bottom">
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
<br> Saturday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
<br> Sunday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
<br> Monday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
</tbody>
</table>'''
soup=BeautifulSoup(data,'html.parser')
for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
tag= slots.find("a",id=True)
if tag:
print(tag.attrs["id"])
您可以使用單個 css 選擇器實現相同的目的。
for slots in soup.select('.pb-15.text-center>a[id]'):
if slots:
print(slots.attrs["id"])
Output :
1217464_1_13/6/2020 12:00:00 AM
1217482_1_14/6/2020 12:00:00 AM
1217506_1_15/6/2020 12:00:00 AM
更新
for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
tag= slots.find("a",attrs={"id",True})
if tag:
print(tag.attrs["id"])
html = '''
<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
<thead>
<tr>
<th width="14%" class="text-left nowrap fixed-side">Session Date</th>
<th width="14%" class="text-center">
<b>1</b>
</th>
<th width="14%" class="text-center">
<b>2</b>
</tr>
</thead>
<tbody class="tr-border-bottom">
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
<br> Saturday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
<br> Sunday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
<td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
</tr>
<tr>
<th class="pb-15 text-left fixed-side">
<a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
<br> Monday
</th>
<td class="pb-15 text-center">
<a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
8:15 AM ✔
</a>
</td>
</tr>
</tbody>
</table>'''
from bs4 import BeautifulSoup as bs
soup = bs(html, 'html.parser')
slots = soup.select("td[class='pb-15 text-center'] a")
for slot in slots:
#slot.attrs is a dictionary so you can avoid NoneType Expection using .get method
#slot_id = slot.attrs.get("id",'') this will return '' if there is no id attribute in the tag
slot_id = slot.attrs.get("id",'')
print(slot_id)
Output:
1217464_1_13/6/2020 12:00:00 AM
1217482_1_14/6/2020 12:00:00 AM
1217506_1_15/6/2020 12:00:00 AM
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.