簡體   English   中英

在 BeautifulSoup 中的一個動態網頁中抓取多個表格

[英]Scraping multiple tables in one dynamic webpage in BeautifulSoup

我想從動態網頁https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 中抓取多個表格我嘗試了以下代碼,但收到以下內容錯誤。 我想讓輸出顯示在底部。

df = pd.DataFrame()
driver = webdriver.Chrome('/Users/alau/Downloads/chromedriver')
driver.get('https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
tables = soup.find_all('table', {'class':'bigborder'})
subheads = soup.find_all('td', {'class':'subheader'}).text.replace('\n','!')
def tableDataText(tables):       
    rows = []
    trs = tables.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row    
    return rows
result_table = tableDataText(bt_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])

AttributeError: ResultSet 對象沒有屬性“find_all”。 您可能將項目列表視為單個項目。 當您打算調用 find() 時,您是否調用了 find_all()?

輸出

在此處輸入圖片說明

您必須發送帶有anti-bot cookie 的POST請求才能在響應中獲取HTML

以下是使用BeautifulSoup

import pandas as pd
import requests
from bs4 import BeautifulSoup


cookies = {
    "BotMitigationCookie_9518109003995423458": "381951001600933518cRI6X6LoZp9tUD7Ls04ETZpx41s=",
}
url = "https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18"

response = requests.post(url, cookies=cookies).text
soup = BeautifulSoup(response, "html.parser").find_all("table", {"class": "bigborder"})

columns = [
    "Horse", "Jockey", "Trainer", "Draw", "Gear", "LBW",
    "Running Position", "Time", "Result", "Comment",
]


def get_data():
    for table in soup.find_all("table", {"class": "bigborder"}):
        for tr in table.find_all("tr", {"bgcolor": "#eeeeee"}):
            yield [
                i.find("font").getText().strip().replace(";", "")
                for i in tr.find_all("td")
            ]


df = pd.DataFrame([table for table in get_data()], columns=columns)
df.to_csv("data.csv", index=False)

這讓你:

在此處輸入圖片說明

import pandas as pd
import requests

cookies = {
    'BotMitigationCookie_9518109003995423458': '343775001600940465b2KTzJpwY5pXpiVNIRRi97Z3ELk='
}


def main(url):
    r = requests.post(url, cookies=cookies)
    df = pd.read_html(r.content, header=0, attrs={'class': 'bigborder'})
    new = pd.concat(df, ignore_index=True)
    print(new)
    new.to_csv("data.csv", index=False)


main("https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18")

輸出: 在線查看

                       Horse  ...                                            Comment
0              LARSON (D199)  ...      Being freshened up; led all the way to score.        
1      PRIVATE ROCKET (C367)  ...         Sat behind the leader; ran on comfortably.        
2        WIND N GRASS (D197)  ...  Slightly slow to begin; made progress under a ...        
3      VOYAGE WARRIOR (C247)  ...           In 2nd position; slightly weakened late.        
4         BEAUTY RUSH (C475)  ...              Bounded on jumping; settled midfield.        
..                       ...  ...                                                ...        
59  BUNDLE OF DELIGHT (D236)  ...    Raced along the rail; ran on OK when persuaded.        
60          GOOD DAYS (A333)  ...              Hit the line well when clear at 300m.        
61   YOU HAVE MY WORD (V149)  ...  Well tested in the Straight; moved better than...        
62          PLIKCLONE (D003)  ...       Average to begin; raced under his own steam.        
63    REEVE'S MUNTJAC (C174)  ...  The stayer raced under his own steam to stretc...        

[64 rows x 10 columns]

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM