簡體   English   中英

用 beautifulsoup 刮取 html id

[英]Scraping html id with beautifulsoup

我在從以下 html 文件中刮取 html ID 時遇到問題,因為有 2 行代碼在14 Jun 2020 6 月 14 日以下沒有任何 ID,這意味着8.15am on 14 June之后沒有更多的預約空檔, 6月15日恢復。

<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
    <thead>
        <tr>
            <th width="14%" class="text-left nowrap fixed-side">Session Date</th>
            <th width="14%" class="text-center">
                <b>1</b>
            </th>
            <th width="14%" class="text-center">
                <b>2</b>
        </tr>
    </thead>
    <tbody class="tr-border-bottom">
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
                <br> Saturday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>

        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
                <br> Sunday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
                <br> Monday
            </th>

            <td class="pb-15 text-center">
                <a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>
        </tr>
    </tbody>
</table>

我想出了下面的代碼,但只有在8.15am 14th June 2020 slot點 15 分之前的約會的 html ID 會被打印出來。 然后在打印8.15am 14 June 8.15 時段的 ID 后遇到TypeError(NoneType object 不可迭代) ,並且沒有打印 6 月 15 日時段的 ID。

for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
    tags = slots.find("a")
    for IDS in tags:
        IDS = tags.attrs["id"]
    print (IDS)

我也在這里嘗試了異常處理,但是我遇到了語法錯誤(而且我不太確定我到底做錯了什么)。

for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
    tags = slots.find("a")
    for IDS in tags:
        try:
            IDS = tags.attrs["id"]
        except TypeError:
            else:
            print (IDS)

只需檢查標簽是否有帶有id屬性的標簽,然后打印。

data='''<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
    <thead>
        <tr>
            <th width="14%" class="text-left nowrap fixed-side">Session Date</th>
            <th width="14%" class="text-center">
                <b>1</b>
            </th>
            <th width="14%" class="text-center">
                <b>2</b>
        </tr>
    </thead>
    <tbody class="tr-border-bottom">
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
                <br> Saturday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>

        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
                <br> Sunday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">


                                             8:15 AM ✔
                                                                    </a>
            </td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
                <br> Monday
            </th>

            <td class="pb-15 text-center">
                <a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>
        </tr>
    </tbody>
</table>'''

soup=BeautifulSoup(data,'html.parser')

for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
    tag= slots.find("a",id=True)
    if tag:
        print(tag.attrs["id"])

您可以使用單個 css 選擇器實現相同的目的。

for slots in soup.select('.pb-15.text-center>a[id]'):
    if slots:
        print(slots.attrs["id"])

Output

1217464_1_13/6/2020 12:00:00 AM
1217482_1_14/6/2020 12:00:00 AM
1217506_1_15/6/2020 12:00:00 AM

更新

for slots in soup.findAll(attrs={"class" : "pb-15 text-center"}):
    tag= slots.find("a",attrs={"id",True})
    if tag:
        print(tag.attrs["id"])
html = '''
<table class="table table-borderless table-striped no-background clear-padding-first-child available-slots-mobile main-table clone">
    <thead>
        <tr>
            <th width="14%" class="text-left nowrap fixed-side">Session Date</th>
            <th width="14%" class="text-center">
                <b>1</b>
            </th>
            <th width="14%" class="text-center">
                <b>2</b>
        </tr>
    </thead>
    <tbody class="tr-border-bottom">
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('13 Jun 2020');">13 Jun 2020</a>
                <br> Saturday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217464_1_13/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>

        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('14 Jun 2020');">13 Jun 2020</a>
                <br> Sunday
            </th>
            <td class="pb-15 text-center">
                <a href="#" id="1217482_1_14/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
            <td class="pb-15 text-center"><span class="c-gray">n/a</span></td>
        </tr>
        <tr>
            <th class="pb-15 text-left fixed-side">
                <a href="javascript:changeDate('15 Jun 2020');">15 Jun 2020</a>
                <br> Monday
            </th>

            <td class="pb-15 text-center">
                <a href="#" id="1217506_1_15/6/2020 12:00:00 AM" class="slotBooking">
                                                                        8:15 AM ✔
                                                                    </a>
            </td>
        </tr>
    </tbody>
</table>'''

from bs4 import BeautifulSoup as bs
soup  = bs(html, 'html.parser')
slots = soup.select("td[class='pb-15 text-center'] a")
for slot in slots:
        #slot.attrs is a dictionary so you can avoid NoneType Expection using .get method 
        #slot_id = slot.attrs.get("id",'') this will return '' if there is no id attribute in the tag
        slot_id = slot.attrs.get("id",'')
        print(slot_id)

Output:

1217464_1_13/6/2020 12:00:00 AM
1217482_1_14/6/2020 12:00:00 AM
1217506_1_15/6/2020 12:00:00 AM

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM