簡體   English   中英

用 BeautifulSoup 抓取復雜的表格

[英]Scraping complicated tables with BeautifulSoup

我正在研究體育博彩刮刀,但是我遇到了一張復雜的桌子。 下面的代碼顯示了大多數元素的外觀。 我的主要重點是從中提取所有文本(參與者姓名、日期和時間、賠率等)

 <tr data-qa="pre-event" class="events-list__grid__event"><th scope="row" class="events-list__grid__info"><div class="events-list__grid__info__datetime"><div class="events-list__grid__info__datetime__time"> 20:05 </div> <div class="events-list__grid__info__datetime__date"> 24/07 </div></div> <a href="/cote/sara-errani-paula-ormaechea/27034463/" class="GTM-event-link events-list__grid__info__main" data-testid="TENN" title="WTA - Varșovia - Calificări (F)"><div class="events-list__grid__info__main__row"><div class="events-list__grid__info__main__participants"><div class="events-list__grid__info__main__participants__participant"><span class="events-list__grid__info__main__participants__participant-name"><:----> Sara Errani <.----></span> <.----></div><div class="events-list__grid__info__main__participants__participant"><span class="events-list__grid__info__main__participants__participant-name"><.----> Paula Ormaechea <.----></span> <.----></div> <.----></div> <div class="events-list__grid__info__main__actions"><span class="event-icons"><.----> <.----> <svg xmlns="http.//www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M18.545 6H5.455C4.655 6 4 6.668 4 7.5v9c0.825.655 1.5 1.455 1.5h13.09c.8 0 1.455-.675 1.455-1.5v-9c0-.832-.655-1:5-1.455-1.5zm0 10.5H5.455v-9h13.09v9zM9.818 9v6l5.091-3-5.09-3z"></path></svg> <svg xmlns="http.//www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 kz-icon icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M7.833 19.5H9.5V8.03H7.833V19.5zm3.334 0h1.666v-15h-1.666v15zm-6.667 0h1.667v-7.941H4.5V19.5zm10 0h1.667V8:03H14.5V19.5zm3.333-7.941V19.5H19.5v-7.941h-1.667z"></path></svg> <svg xmlns="http.//www.w3.org/2000/svg" viewBox="0 0 24 24" svg-inline="" role="presentation" focusable="false" tabindex="-1" class="icon--color-cloud-burst-500 icon--clickable kz-icon-xs has-tooltip" data-original-title="null"><path d="M14.2 4.534a.532.532 0 00-.344-.504.503.503 0 00-.572.17l-6.07 7.862a.96.96 0 00-.131.996c.147.33.466.542.817.542h1.928c.142 0.258.12.258.267v5.6c0.226.138.428.344.503a.503.503 0 00.572-.17l6.07-7.862a.96.96 0 00.13-.996.899.899 0 00-.817-.542h-1.928a.262.262 0 01-.257-.267v-5.6z"></path></svg> <!----> <!----></span> <!----></div></div> <!----></a></th> <td class="table__markets__market"><div><section><div class="table__markets__market__title"><div class="table__markets__market__title__text"> Câştigător </div> <div class="table__markets__market__title__markets"><a href="/cote/sara-errani-paula-ormaechea/27034463/" class="table__markets__market__title__markets__link"> +4 </a></div></div> <div class="selections"><button aria-label="Bet on Sara Errani with odds 1.17." data-selnid="2685084631" data-qa="pre-event-selection" class="selections__selection selections__selection--columns-2 GTM-selection-add" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span class="selections__selection__odd"><!--fragment#15ac200c85#head--> 1.17 <!--fragment#15ac200c85#tail--></span></button><button aria-label="Bet on Paula Ormaechea with odds 4.6." data-selnid="2685084632" data-qa="pre-event-selection" class="selections__selection selections__selection--columns-2 GTM-selection-add" mc-data="[object Object]" event-url=""><!----> <!----> <!----> <!----> <span class="selections__selection__odd"><!--fragment#80111e10a3#head--> 4.60 <!--fragment#80111e10a3#tail--></span></button> <!----></div></section></div></td><td class="table__markets__market"></td><td class="table__markets__market"></td> <td class="events-list__grid__total-markets"> +4 </td></tr>

在這種情況下,我需要的是: '20:05; 24/07; 薩拉·埃拉尼; 保拉 Ormaechea; +4; 1.17; 4.6' + “Sara Errani”上方的鏈接。

如何遍歷所有 tr 元素並提取相關數據?

使用包含問題數據的 html_doc:

  1. 分析湯並創建要提取的數據的映射
    • 查找要提取的標簽的類/ID/名稱(在本例中僅為類)
    • 定義要提取的標簽和數量
    • 構建您自己的映射,這將使您有可能創建迭代
  2. 遍歷映射
    • 使用您的映射完成工作
  3. 收集結果

問候...

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
results = {}

for k, lst in mappings.items():
    for i in range(lst[2]):
        elems = soup.find_all(lst[0], attrs={'class': lst[1]})
        if k != 'href':
            results[k + '_' + str(i + 1)] = elems[i].text.strip()
        else:
            results[k + '_' + str(i + 1)] = elems[i]['href']

print(results)
#
#   R e s u l t :
#
#   { 
#     'time_1': '20:05', 
#     'date_1': '24/07', 
#     'href_1': '/cote/sara-errani-paula-ormaechea/27034463/', 
#     'name_1': 'Sara Errani', 
#     'name_2': 'Paula Ormaechea', 
#     'link_1': '+4', 
#     'odd_1': '1.17', 
#     'odd_2': '4.60'
#   }

添加:
將您的最新數據作為 html_doc ( https://pastebin.com/nx6x00NX )
添加了行迭代和事件編號。
STH(用戶:56338)的 Function pretty( ) 來自( How to pretty print nested dictionaries?
如果您可以獲得表定義湯,它將與此表行迭代一起使用 - 代碼的 rest 與原來相同

from bs4 import BeautifulSoup

def pretty(dct, indent=0):      # function by ---> STH user:56338
    for key, value in dct.items():
        print('\t' * indent + str(key))
        if isinstance(value, dict):
            pretty(value, indent+1)
        else:
            print('\t' * (indent+1) + str(value))
         
soup = BeautifulSoup(html_doc, 'html.parser')

mappings = {
    "time": ["div", "events-list__grid__info__datetime__time", 1],
    "date": ["div", "events-list__grid__info__datetime__date", 1],
    "href": ["a", "GTM-event-link events-list__grid__info__main", 1],
    "name": ["span", "events-list__grid__info__main__participants__participant-name", 2],
    "link": ["a", "table__markets__market__title__markets__link", 1],
    "odd": ["span", "selections__selection__odd", 2]
    }
    
events = {}
results = {}
rows = soup.find_all("tr", attrs={'class': "events-list__grid__event"})
nr = 0
for row_soup in rows:
    for k, lst in mappings.items():
        for i in range(lst[2]):
            elems = row_soup.find_all(lst[0], attrs={'class': lst[1]})
            if k != 'href':
                results[k + '_' + str(i + 1)] = elems[i].text.strip()
            else:
                results[k + '_' + str(i + 1)] = elems[i]['href']
    nr += 1
    events['event_' + str(nr)] = results
    results = {}
    
pretty(events)
#
'''     R e s u l t
event_1
        time_1
                22:47
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/sophia-yang-tatum-burger/27018714/
        name_1
                Sophia Yang
        name_2
                Tatum Burger
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_2
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/cleo-hutchinson-seha-yu/27018746/
        name_1
                Cleo Hutchinson
        name_2
                Seha YU
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_3
        time_1
                23:30
        date_1
                24/07
        href_1
                https://ro.betano.com/cote/laura-bente-josie-frazier/27018754/
        name_1
                Laura Bente
        name_2
                Josie Frazier
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
event_4
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/kelly-keller-emma-sun/27018749/
        name_1
                Kelly Keller
        name_2
                Emma Sun
        link_1
                +4
        odd_1
                1.45
        odd_2
                2.60
event_5
        time_1
                00:00
        date_1
                25/07
        href_1
                https://ro.betano.com/cote/nadia-kojonroj-tanvi-narendran/27018750/
        name_1
                Nadia Kojonroj
        name_2
                Tanvi Narendran
        link_1
                +4
        odd_1
                1.87
        odd_2
                1.87
'''

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM