繁体   English   中英

Web 使用 Python 和 Pandas 抓取 - 分页

[英]Web scraping with Python and Pandas - Pagination

使用这个简短的代码,我可以从表中获取数据:

import pandas as pd

df=pd.read_html('https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page=1&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

df[0].to_csv('2023_I_M_800.csv')

我正在尝试从所有页面或确定数量的页面获取数据,但由于该网站不使用 lu 或 li elementsIdon't know exacxtly how to build it.

任何帮助或想法将不胜感激。

尝试这个:

for page in range(1, 10):
    df=pd.read_html(f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular',parse_dates=True)

    df[0].to_csv(f'2023_I_M_800_page_{page}.csv')

由于concat包含页码,为什么不直接进行循环和连接呢?

`https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic& page=1 &bestResultsOnly=false&oversizedTrack=regular

import pandas as pd
​
F, L = 1, 4 # first and last pages
​
dico = {}
for page in range(F, L+1):
    url = f'https://www.worldathletics.org/records/toplists/middle-long/800-metres/indoor/men/senior/2023?regionType=world&timing=electronic&page={page}&bestResultsOnly=false&oversizedTrack=regular'
    sub_df = pd.read_html(url, parse_dates=True)[0]
    sub_df.insert(0, "page_number", page)
    dico[page] = sub_df
    ​
out = pd.concat(dico, ignore_index=True)
# out.to_csv('2023_I_M_800.csv') # <- uncomment this line to make a .csv

注意:您可以使用键索引符号单独访问每个sub_dfdico[num_page]

Output:

print(out)

     page_number  Rank  ...         Date Results Score
0              1     1  ...  22 JAN 2023          1230
1              1     2  ...  22 JAN 2023          1204
2              1     3  ...  29 JAN 2023          1204
3              1     4  ...  27 JAN 2023          1192
4              1     5  ...  28 JAN 2023          1189
..           ...   ...  ...          ...           ...
395            4   394  ...  21 JAN 2023           977
396            4   394  ...  28 JAN 2023           977
397            4   398  ...  27 JAN 2023           977
398            4   399  ...  28 JAN 2023           977
399            4   399  ...  29 JAN 2023           977

[400 rows x 11 columns]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM