在 BeautifulSoup 中的一个动态网页中抓取多个表格

Question

I would like to scrape multiple tables from a dynamic webpage https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 I have tried the following codes but receiving the following error.我想从动态网页https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 中抓取多个表格我尝试了以下代码，但收到以下内容错误。 I would like to get the output shown at the bottom.我想让输出显示在底部。

df = pd.DataFrame()
driver = webdriver.Chrome('/Users/alau/Downloads/chromedriver')
driver.get('https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
tables = soup.find_all('table', {'class':'bigborder'})
subheads = soup.find_all('td', {'class':'subheader'}).text.replace('\n','!')
def tableDataText(tables):       
    rows = []
    trs = tables.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row    
    return rows
result_table = tableDataText(bt_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])

AttributeError: ResultSet object has no attribute 'find_all'. AttributeError: ResultSet 对象没有属性“find_all”。 You're probably treating a list of items like a single item.您可能将项目列表视为单个项目。 Did you call find_all() when you meant to call find()?当您打算调用 find() 时，您是否调用了 find_all()？

Output输出

enter image description here在此处输入图片说明

Answer 1

You have to send a POST request with a anti-bot cookie to get the HTML in the response.您必须发送带有anti-bot cookie 的POST请求才能在响应中获取HTML 。

Here's how to do it with BeautifulSoup :以下是使用BeautifulSoup ：

import pandas as pd
import requests
from bs4 import BeautifulSoup


cookies = {
    "BotMitigationCookie_9518109003995423458": "381951001600933518cRI6X6LoZp9tUD7Ls04ETZpx41s=",
}
url = "https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18"

response = requests.post(url, cookies=cookies).text
soup = BeautifulSoup(response, "html.parser").find_all("table", {"class": "bigborder"})

columns = [
    "Horse", "Jockey", "Trainer", "Draw", "Gear", "LBW",
    "Running Position", "Time", "Result", "Comment",
]


def get_data():
    for table in soup.find_all("table", {"class": "bigborder"}):
        for tr in table.find_all("tr", {"bgcolor": "#eeeeee"}):
            yield [
                i.find("font").getText().strip().replace(";", "")
                for i in tr.find_all("td")
            ]


df = pd.DataFrame([table for table in get_data()], columns=columns)
df.to_csv("data.csv", index=False)

This gets you:这让你：

Answer 2

import pandas as pd
import requests

cookies = {
    'BotMitigationCookie_9518109003995423458': '343775001600940465b2KTzJpwY5pXpiVNIRRi97Z3ELk='
}


def main(url):
    r = requests.post(url, cookies=cookies)
    df = pd.read_html(r.content, header=0, attrs={'class': 'bigborder'})
    new = pd.concat(df, ignore_index=True)
    print(new)
    new.to_csv("data.csv", index=False)


main("https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18")

Output: view-online输出：在线查看

                       Horse  ...                                            Comment
0              LARSON (D199)  ...      Being freshened up; led all the way to score.        
1      PRIVATE ROCKET (C367)  ...         Sat behind the leader; ran on comfortably.        
2        WIND N GRASS (D197)  ...  Slightly slow to begin; made progress under a ...        
3      VOYAGE WARRIOR (C247)  ...           In 2nd position; slightly weakened late.        
4         BEAUTY RUSH (C475)  ...              Bounded on jumping; settled midfield.        
..                       ...  ...                                                ...        
59  BUNDLE OF DELIGHT (D236)  ...    Raced along the rail; ran on OK when persuaded.        
60          GOOD DAYS (A333)  ...              Hit the line well when clear at 300m.        
61   YOU HAVE MY WORD (V149)  ...  Well tested in the Straight; moved better than...        
62          PLIKCLONE (D003)  ...       Average to begin; raced under his own steam.        
63    REEVE'S MUNTJAC (C174)  ...  The stayer raced under his own steam to stretc...        

[64 rows x 10 columns]

在 BeautifulSoup 中的一个动态网页中抓取多个表格

问题描述

2 个解决方案

解决方案1
0 2020-09-24 08:18:18

解决方案2
0 2020-09-24 09:49:27

在 BeautifulSoup 中的一个动态网页中抓取多个表格

问题描述

2 个解决方案

解决方案1 0 2020-09-24 08:18:18

解决方案2 0 2020-09-24 09:49:27

解决方案1
0 2020-09-24 08:18:18

解决方案2
0 2020-09-24 09:49:27