简体   繁体   中英

Double for loop to extract data from several urls

I am trying to get data from a website to write them on an excel file to be worked on. I have a main url scheme and I have to change the "year" and the "reference number" accordingly:

http://calcio-seriea.net/presenze/ "year"/"reference number"/

I already tried to write a part of the code but I have one issue. First of all, I should keep the year the same while the reference number takes every number of an interval of 18. Then the year increases of 1, and the reference number take again every number of an interval of 18. I try to give an example:

Y = 1998 RN = [1142:1159];
Y = 1999 RN = [1160:1177];
Y = 2000 RN = [1178:1195];
Y = … RN = …

Then from year 2004 the interval becomes of 20, so

Y = 2004 RN = [1250:1269];
Y = 2005 RN = [1270:1289];

Till year = 2019 included.

This is the code I could make so far:

import pandas as pd

year = str(1998)

all_items = []

for i in range(1142, 1159):
    pattern = "http://calcio-seriea.net/presenze/" + year + "/" + str(i) + "/"

    df = pd.read_html(pattern)[6]

    all_items.append(df)

pd.DataFrame(all_items).to_csv(r"C:\Users\glcve\Desktop\data.csv", index = False, header = False)

print("Done!")

Thanks to all in advance

All that's missing is a pd.concat from your function, however as you're calling the same method over and over, lets write a function so you can keep your code dry.

def create_html_df(base_url, year,range_nums = ()):
    """
    Returns a dataframe from a url/html table
    base_url : the url to target
    year : the target year.
    range_nums = the range of numbers i.e (1,50)

   """
    start, stop = range_nums
    url_pat = [f"{base_url}/{year}/{i}" for i in range(start,stop)]
    dfs = []
    for each_url in url_pat:
        df = pd.read_html(each_url)[6]
        dfs.append(df)

    return pd.concat(dfs)

final_df = create_html_df(base_url = "http://calcio-seriea.net/presenze/",
               year = 1998,
               range_nums = (1142, 1159))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM