简体   繁体   中英

Scraping multiple tables in one dynamic webpage in BeautifulSoup

I would like to scrape multiple tables from a dynamic webpage https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 I have tried the following codes but receiving the following error. I would like to get the output shown at the bottom.

df = pd.DataFrame()
driver = webdriver.Chrome('/Users/alau/Downloads/chromedriver')
driver.get('https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
tables = soup.find_all('table', {'class':'bigborder'})
subheads = soup.find_all('td', {'class':'subheader'}).text.replace('\n','!')
def tableDataText(tables):       
    rows = []
    trs = tables.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row    
    return rows
result_table = tableDataText(bt_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])

AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Output

enter image description here

You have to send a POST request with a anti-bot cookie to get the HTML in the response.

Here's how to do it with BeautifulSoup :

import pandas as pd
import requests
from bs4 import BeautifulSoup


cookies = {
    "BotMitigationCookie_9518109003995423458": "381951001600933518cRI6X6LoZp9tUD7Ls04ETZpx41s=",
}
url = "https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18"

response = requests.post(url, cookies=cookies).text
soup = BeautifulSoup(response, "html.parser").find_all("table", {"class": "bigborder"})

columns = [
    "Horse", "Jockey", "Trainer", "Draw", "Gear", "LBW",
    "Running Position", "Time", "Result", "Comment",
]


def get_data():
    for table in soup.find_all("table", {"class": "bigborder"}):
        for tr in table.find_all("tr", {"bgcolor": "#eeeeee"}):
            yield [
                i.find("font").getText().strip().replace(";", "")
                for i in tr.find_all("td")
            ]


df = pd.DataFrame([table for table in get_data()], columns=columns)
df.to_csv("data.csv", index=False)

This gets you:

在此处输入图片说明

import pandas as pd
import requests

cookies = {
    'BotMitigationCookie_9518109003995423458': '343775001600940465b2KTzJpwY5pXpiVNIRRi97Z3ELk='
}


def main(url):
    r = requests.post(url, cookies=cookies)
    df = pd.read_html(r.content, header=0, attrs={'class': 'bigborder'})
    new = pd.concat(df, ignore_index=True)
    print(new)
    new.to_csv("data.csv", index=False)


main("https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18")

Output: view-online

                       Horse  ...                                            Comment
0              LARSON (D199)  ...      Being freshened up; led all the way to score.        
1      PRIVATE ROCKET (C367)  ...         Sat behind the leader; ran on comfortably.        
2        WIND N GRASS (D197)  ...  Slightly slow to begin; made progress under a ...        
3      VOYAGE WARRIOR (C247)  ...           In 2nd position; slightly weakened late.        
4         BEAUTY RUSH (C475)  ...              Bounded on jumping; settled midfield.        
..                       ...  ...                                                ...        
59  BUNDLE OF DELIGHT (D236)  ...    Raced along the rail; ran on OK when persuaded.        
60          GOOD DAYS (A333)  ...              Hit the line well when clear at 300m.        
61   YOU HAVE MY WORD (V149)  ...  Well tested in the Straight; moved better than...        
62          PLIKCLONE (D003)  ...       Average to begin; raced under his own steam.        
63    REEVE'S MUNTJAC (C174)  ...  The stayer raced under his own steam to stretc...        

[64 rows x 10 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM