I want to collect all csv
files from the following Github Repository link below and want to make it a new csv
file (for data cleaning purpose):
So that my new csv
file will contain data from all dates.
Using the following command, I will be able to load only 01-01-2021.csv.
import numpy as np
import pandas as pd
import requests
df = pd.read_csv ('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2021.csv')
df.head()
How to load all csv
files at once?
Here is a short solution using pandas
, requests
and BeautifulSoup
to filter all the csv links:
import pandas as pd
import requests
from bs4 import BeautifulSoup, SoupStrainer
html = requests.get('https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports')
dfs = []
for link in BeautifulSoup(html.text, parse_only=SoupStrainer('a')):
if hasattr(link, 'href') and link['href'].endswith('.csv'):
url = 'https://github.com'+link['href'].replace('/blob/', '/raw/')
dfs.append(pd.read_csv(url))
df = pd.concat(dfs)
NB. testing the code, this runs in ~12min and yields a 2300506 rows × 21 columns final dataframe. Ideally one should add a multi-threading to it to download several files in parallel (reasonably, not to get kicked by the server)
check out pd.concat ?
Assume that you have all file links:
dfs = []
for l in links:
df = pd.read_csv(l)
dfs.append(df)
final_df = pd.concat(dfs)
The link you have provided has the csv file names in the format month-day-year.csv. So i have made a loop to create a filename and load csv directly from the given URL. This should work unless the website has random naming convention of the csv files.
years = [2020, 2021]
months = [month for month in range(1, 13)]
days = [day for day in range(1, 31)]
URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-
19/master/csse_covid_19_data/csse_covid_19_daily_reports'
all_files = []
for year in years:
for month in months:
month = str(month).zfill(2)
for day in days:
day = str(day).zfill(2)
print(f"{month}-{day}-{year}.csv")
df = pd.read_csv(URL + f"/{month}-{day}-{year}.csv")
all_files.append(df)
final_csv_file = pd.concat(all_files, axis=0, ignore_index=True)
This is the snapshot of the output that i got from the above source code. But here, i have looped for only two elements 1,2 for both day and month, and year 2021. As long as the website has non-random naming convention, this should work.
Here you go! You can specify the start and end dates to get all the data from those dates in between them. This also checks if the url for that particular date is present or not, and only if it is a valid url, does it add it to the final data frame.
import requests
import pandas as pd
def is_leap_year(year):
# checks if the current year is leap year
"""
params:
year - int
returns:
bool
"""
if((year%4==0 and year%100!=0) or (year%400==0)):
return True
else:
return False
def split_date(date_str):
# Splits the date into month, day and year
"""
params:
date_str - str (mm-dd-yyyy)
returns:
month - int
day - int
year - int
"""
month, day, year = list(int(x) for x in date_str.split("-")) # For US standards, for rest of the world feel free to swap month and day
return month, day, year
def generate_dates(start_date, end_date):
# This doesn't validate the dates and it is assumed that the start_date and end_dates both are valid dates with the end date > start_date
# This generates all dates bw start date and end date and also takes into account leap year as well
"""
params:
start_date - str (mm-dd-yyyy)
end_date - str (mm-dd-yyyy)
returns:
dates - list of strings of dates between start_date and end_date
"""
dates = []
start_month, start_day, start_year = split_date(start_date)
end_month, end_day, end_year = split_date(end_date)
year = start_year
while(year<=end_year):
month = start_month if(year==start_year) else 1
max_month = end_month if(year==end_year) else 12
while(month<=max_month):
day = start_day if(year==start_year) else 1
if(month==2):
max_day = 29 if(is_leap_year(year)) else 28
else:
max_day = 31 if(start_month%2!=0) else 30
if(year==end_year and month==end_month):
max_day = end_day
while(day<=max_day):
new_date = f"{month}-{day}-{year}"
dates.append(new_date)
day+=1
month+=1
year+=1
return dates
def check_if_url_is_valid(url):
# This checks if the url is valid through the python requests library, by making a GET request. if the url is present and valid then it returns status code in (200-300)
"""
params:
url - str
returns:
bool
"""
r = requests.get(url)
if(r.status_code in range(200,300)):
return True
else:
return False
def to_df(base_url, start_date, end_date):
# Takes all the generated dates, creates a url for each date through the base url and then tries to download it, else prints out an error message
"""
params:
base_url - str it should be of the format "https://github.com/{}.csv" where the {} will be used for string formatting and different dates will be put into it
returns:
final_df - pd.DataFrame
"""
files = []
dates = generate_dates(start_date, end_date)
for date in dates:
url = base_url.format(date)
valid_url = check_if_url_is_valid(url)
if(valid_url):
df = pd.read_csv(url)
files.append(df)
else:
print(f"Could not download {date} data as it may be unavailable")
final_df = pd.concat(files)
print(f"\n Downloaded {len(files)} files!\n")
return final_df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.