简体   繁体   English

双 for 循环从多个 url 中提取数据

[英]Double for loop to extract data from several urls

I am trying to get data from a website to write them on an excel file to be worked on.我正在尝试从网站获取数据以将它们写入要处理的 excel 文件。 I have a main url scheme and I have to change the "year" and the "reference number" accordingly:我有一个主要的 url 方案,我必须相应地更改“年份”和“参考号”:

http://calcio-seriea.net/presenze/ "year"/"reference number"/ http://calcio-seriea.net/presenze/ “年份”/“参考号”/

I already tried to write a part of the code but I have one issue.我已经尝试编写部分代码,但我有一个问题。 First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:首先,我应该保持年份不变,而参考数取18的每个数字。然后年份增加1,参考数再次取18的每个数字。我试着给一个例子:

Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …

Then from year 2004 the interval becomes of 20, so然后从 2004 年开始,间隔变为 20,所以

Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];

Till year = 2019 included.到年 = 包括 2019 年。

This is the code I could make so far:这是我到目前为止可以制作的代码:

import pandas as pd

year = str(1998)

all_items = []

for i in range(1142, 1159):
    pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"

    df = pd.read_html(pattern)[6]

    all_items.append(df)

pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)

print("Done!")

Thanks to all in advance提前感谢大家

All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.所缺少的只是pd.concat中的 pd.concat,但是当您一遍又一遍地调用相同的方法时,让我们编写一个 function 以便您可以保持代码干燥。

def create_html_df(base_url, year,range_nums = ()):
    """
    Returns a dataframe from a url/html table
    base_url : the url to target
    year : the target year.
    range_nums = the range of numbers i.e (1,50)

   """
    start, stop = range_nums
    url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
    dfs = []
    for each_url in url_pat:
        df = pd.read_html(each_url)[6]
        dfs.append(df)

    return pd.concat(dfs)

final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
               year = 1998,
               range_nums = (1142, 1159))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM