![](/img/trans.png)
[英]web-scraping and pagination with python, csv, beautifulsoup and Pandas
[英]Web scraping with Python and Pandas - Pagination
使用這個簡短的代碼,我可以從表中獲取數據:
import pandas as pd
df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv('2023_I_M_800.csv')
我正在嘗試從所有頁面或確定數量的頁面獲取數據,但由於該網站不使用 lu 或 li elementsIdon't know exacxtly how to build it.
任何幫助或想法將不勝感激。
嘗試這個:
for page in range(1, 10):
df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)
df[0].to_csv(f'2023_I_M_800_page_{page}.csv')
由於concat
包含頁碼,為什么不直接進行循環和連接呢?
`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic& page=1 &bestResultsOnly=false&oversizedTrack=regular
import pandas as pd
F, L = 1, 4 # first and last pages
dico = {}
for page in range(F, L+1):
url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
sub_df = pd.read_html(url, parse_dates=True)[0]
sub_df.insert(0, "page_number", page)
dico[page] = sub_df
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv
注意:您可以使用鍵索引符號單獨訪問每個sub_df
: dico[num_page]
。
Output:
print(out)
page_number Rank ... Date Results Score
0 1 1 ... 22 JAN 2023 1230
1 1 2 ... 22 JAN 2023 1204
2 1 3 ... 29 JAN 2023 1204
3 1 4 ... 27 JAN 2023 1192
4 1 5 ... 28 JAN 2023 1189
.. ... ... ... ... ...
395 4 394 ... 21 JAN 2023 977
396 4 394 ... 28 JAN 2023 977
397 4 398 ... 27 JAN 2023 977
398 4 399 ... 28 JAN 2023 977
399 4 399 ... 29 JAN 2023 977
[400 rows x 11 columns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.