[英]Scraping multiple tables in one dynamic webpage in BeautifulSoup
I would like to scrape multiple tables from a dynamic webpage https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 I have tried the following codes but receiving the following error.我想从动态网页https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18 中抓取多个表格我尝试了以下代码,但收到以下内容错误。 I would like to get the output shown at the bottom.我想让输出显示在底部。
df = pd.DataFrame()
driver = webdriver.Chrome('/Users/alau/Downloads/chromedriver')
driver.get('https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18')
res = driver.execute_script('return document.documentElement.outerHTML')
time.sleep(3)
driver.quit()
soup = BeautifulSoup(res, 'lxml')
tables = soup.find_all('table', {'class':'bigborder'})
subheads = soup.find_all('td', {'class':'subheader'}).text.replace('\n','!')
def tableDataText(tables):
rows = []
trs = tables.find_all('tr')
headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
if headerow: # if there is a header row include first
rows.append(headerow)
trs = trs[1:]
for tr in trs: # for every table row
rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
return rows
result_table = tableDataText(bt_table)
df = pd.DataFrame(result_table[1:], columns=result_table[0])
AttributeError: ResultSet object has no attribute 'find_all'. AttributeError: ResultSet 对象没有属性“find_all”。 You're probably treating a list of items like a single item.您可能将项目列表视为单个项目。 Did you call find_all() when you meant to call find()?当您打算调用 find() 时,您是否调用了 find_all()?
Output输出
You have to send a POST
request with a anti-bot
cookie to get the HTML
in the response.您必须发送带有anti-bot
cookie 的POST
请求才能在响应中获取HTML
。
Here's how to do it with BeautifulSoup
:以下是使用BeautifulSoup
:
import pandas as pd
import requests
from bs4 import BeautifulSoup
cookies = {
"BotMitigationCookie_9518109003995423458": "381951001600933518cRI6X6LoZp9tUD7Ls04ETZpx41s=",
}
url = "https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18"
response = requests.post(url, cookies=cookies).text
soup = BeautifulSoup(response, "html.parser").find_all("table", {"class": "bigborder"})
columns = [
"Horse", "Jockey", "Trainer", "Draw", "Gear", "LBW",
"Running Position", "Time", "Result", "Comment",
]
def get_data():
for table in soup.find_all("table", {"class": "bigborder"}):
for tr in table.find_all("tr", {"bgcolor": "#eeeeee"}):
yield [
i.find("font").getText().strip().replace(";", "")
for i in tr.find_all("td")
]
df = pd.DataFrame([table for table in get_data()], columns=columns)
df.to_csv("data.csv", index=False)
This gets you:这让你:
import pandas as pd
import requests
cookies = {
'BotMitigationCookie_9518109003995423458': '343775001600940465b2KTzJpwY5pXpiVNIRRi97Z3ELk='
}
def main(url):
r = requests.post(url, cookies=cookies)
df = pd.read_html(r.content, header=0, attrs={'class': 'bigborder'})
new = pd.concat(df, ignore_index=True)
print(new)
new.to_csv("data.csv", index=False)
main("https://racing.hkjc.com/racing/information/english/Horse/BTResult.aspx?Date=2020/09/18")
Output: view-online输出: 在线查看
Horse ... Comment
0 LARSON (D199) ... Being freshened up; led all the way to score.
1 PRIVATE ROCKET (C367) ... Sat behind the leader; ran on comfortably.
2 WIND N GRASS (D197) ... Slightly slow to begin; made progress under a ...
3 VOYAGE WARRIOR (C247) ... In 2nd position; slightly weakened late.
4 BEAUTY RUSH (C475) ... Bounded on jumping; settled midfield.
.. ... ... ...
59 BUNDLE OF DELIGHT (D236) ... Raced along the rail; ran on OK when persuaded.
60 GOOD DAYS (A333) ... Hit the line well when clear at 300m.
61 YOU HAVE MY WORD (V149) ... Well tested in the Straight; moved better than...
62 PLIKCLONE (D003) ... Average to begin; raced under his own steam.
63 REEVE'S MUNTJAC (C174) ... The stayer raced under his own steam to stretc...
[64 rows x 10 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.