I have a link to the website with table like the follow: https://www.timeanddate.com/holidays/kenya/2022
How can I:
So, as a result I need something like below:
Date | Day | Name | Type |
---|---|---|---|
01.01.2022 | sobota | New Year's Day | Public holiday |
20.03.2022 | niedziela | March Equinox | Season |
... | ... | ... | ... |
How can I do that in Python Pandas?
You can do this thanks to beautifulsoup library... If you click right in the web page with google chrome, you can see the structure of the web page, it's well structured, and easy to extract data between html tags. Also, if you want to extract data of all years, just loop on the web url.
https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...
To read the table on the website as a DataFrame in Jupyter Notebook, you can directly use the pandas library. You can try something similar to this:
from datetime import datetime as dt
import pandas as pd
# Year
year = "2022"
# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]
# Drop NaN
df = df.dropna()
# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]
# Display the DataFrame
df
read this table as DataFrame
You can probably just use pandas.read_html
directly.
# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]
and to clean up a bit by resetting the column headers and getting rid of empty rows:
khdf = khdf.set_axis(
['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')
Convert column "Date" so as to have date format like "01.01.2022"
You can parse the date with dateutil.parser
and then format it with .strftime
.
# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]
how to create column "Day" where will be value like: sobota, niedziela and so on
As it is so far, we already have a Day
column with Monday/Tuesday/etc., but if you want them in Polish, you could use a translation dictionary [like daysDict
below].
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]
If you want to translate everything [except for Date
], you could use the googletrans
module. (I think the version installed by default has some issues, but 3.1.0a0
works for me.)
# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]
[because you commented about] "sample of code with loop"
Since the page links have a consistent format, you can loop through various countries and years.
First, import the necessary libraries and define the translation dictionary along with a function that tries to parse and format the date (but returns a null value ( None
) if it fails):
import pandas
from dateutil.parser import parse as duParse
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
def try_dup(dStr, yr):
try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y')
except: return None
then, set the start and end years as well as a list of countries:
startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']
now, we're ready to loop though the countries and years to collect data:
dfList = []
for country in countryList:
for y in range(startYear, endYear+1):
try:
cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
cydf = pandas.read_html(cyUrl)[0]
cydf = cydf.drop(# only the first 4 columns are kept
[c for c in cydf.columns[4:]], axis='columns'
).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
cydf['Country'] = country.capitalize() # add+fill a column with country name
dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
# print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
# except: continue ## skip without printing error
except Exception as e:
print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')
After looping, all the DataFrames can be combined into one before translating the days:
acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns
A sample of acydf
[printed with print(acydf.loc[::66].to_markdown(index=False))
]:
| Country | Date | Day | Name | Type |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya | 01.01.2012 | Niedziela | New Year's Day | Public holiday |
| Kenya | 19.07.2015 | Niedziela | Eid al-Fitr | Public holiday |
| Kenya | 10.10.2018 | Środa | Moi Day | Public holiday |
| Kenya | 10.10.2021 | Niedziela | Huduma Day | Public holiday |
| Kenya | 26.12.2023 | Wtorek | Boxing Day | Public holiday |
| Kenya | 01.01.2027 | Piątek | New Year's Day | Public holiday |
| Kenya | 14.04.2030 | Niedziela | Eid al-Adha (Tentative Date) | Optional Holiday |
| Tonga | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday |
| Tonga | 25.04.2016 | Poniedziałek | ANZAC Day | Public Holiday |
| Tonga | 04.12.2019 | Środa | Anniversary of the Coronation of King Tupou I | Public Holiday |
| Tonga | 04.06.2023 | Niedziela | Emancipation Day | Public Holiday |
| Tonga | 01.01.2027 | Piątek | New Year's Day | Public Holiday |
| Tonga | 04.11.2030 | Poniedziałek | Constitution Day | Public Holiday |
| Belgium | 06.12.2011 | Wtorek | St. Nicholas Day | Observance |
| Belgium | 06.12.2013 | Piątek | St. Nicholas Day | Observance |
| Belgium | 06.12.2015 | Niedziela | St. Nicholas Day | Observance |
| Belgium | 15.11.2017 | Środa | Day of the German-speaking Community | Regional government holiday |
| Belgium | 01.11.2019 | Piątek | All Saints' Day | National holiday |
| Belgium | 31.10.2021 | Niedziela | Halloween | Observance |
| Belgium | 23.09.2023 | Sobota | September Equinox | Season |
| Belgium | 15.08.2025 | Piątek | Assumption of Mary | National holiday |
| Belgium | 11.07.2027 | Niedziela | Day of the Flemish Community | Regional government holiday |
| Belgium | 10.06.2029 | Niedziela | Father's Day | Observance |
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.