简体   繁体   中英

How to read table from url as DataFrame and modify format of data in one column in Python Pandas?

I have a link to the website with table like the follow: https://www.timeanddate.com/holidays/kenya/2022

How can I:

  1. read this table as DataFrame in Jupyter Notebook in Python?
  2. Convert column "Date" so as to have date format like "01.01.2022" not as exists on website "1 sty"
  3. how to create column "Day" where will be value like: sobota, niedziela and so on which currently are between columns "Date" and "Name"?

So, as a result I need something like below:

Date Day Name Type
01.01.2022 sobota New Year's Day Public holiday
20.03.2022 niedziela March Equinox Season
... ... ... ...

How can I do that in Python Pandas?

You can do this thanks to beautifulsoup library... If you click right in the web page with google chrome, you can see the structure of the web page, it's well structured, and easy to extract data between html tags. Also, if you want to extract data of all years, just loop on the web url.

https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...

To read the table on the website as a DataFrame in Jupyter Notebook, you can directly use the pandas library. You can try something similar to this:

from datetime import datetime as dt
import pandas as pd

# Year
year = "2022"

# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]

# Drop NaN
df = df.dropna()

# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]

# Display the DataFrame
df

read this table as DataFrame

You can probably just use pandas.read_html directly.

# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]

and to clean up a bit by resetting the column headers and getting rid of empty rows:

khdf = khdf.set_axis(
    ['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')

Convert column "Date" so as to have date format like "01.01.2022"

You can parse the date with dateutil.parser and then format it with .strftime .

# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]

how to create column "Day" where will be value like: sobota, niedziela and so on

As it is so far, we already have a Day column with Monday/Tuesday/etc., but if you want them in Polish, you could use a translation dictionary [like daysDict below].

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]

If you want to translate everything [except for Date ], you could use the googletrans module. (I think the version installed by default has some issues, but 3.1.0a0 works for me.)

# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
    khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]


[because you commented about] "sample of code with loop"

Since the page links have a consistent format, you can loop through various countries and years.

First, import the necessary libraries and define the translation dictionary along with a function that tries to parse and format the date (but returns a null value ( None ) if it fails):

import pandas
from dateutil.parser import parse as duParse

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}

def try_dup(dStr, yr):
    try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y') 
    except: return None

then, set the start and end years as well as a list of countries:

startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']

now, we're ready to loop though the countries and years to collect data:

dfList = []
for country in countryList:
    for y in range(startYear, endYear+1):
        try: 
            cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
            cydf = pandas.read_html(cyUrl)[0]

            cydf = cydf.drop(# only the first 4 columns are kept
                [c for c in cydf.columns[4:]], axis='columns'
            ).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
            cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
            cydf['Country'] = country.capitalize() # add+fill a column with country name

            dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
            # print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
        # except: continue ## skip without printing error
        except Exception as e: 
            print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')

After looping, all the DataFrames can be combined into one before translating the days:

acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns

A sample of acydf [printed with print(acydf.loc[::66].to_markdown(index=False)) ]:

| Country   | Date       | Day          | Name                                          | Type                        |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya     | 01.01.2012 | Niedziela    | New Year's Day                                | Public holiday              |
| Kenya     | 19.07.2015 | Niedziela    | Eid al-Fitr                                   | Public holiday              |
| Kenya     | 10.10.2018 | Środa        | Moi Day                                       | Public holiday              |
| Kenya     | 10.10.2021 | Niedziela    | Huduma Day                                    | Public holiday              |
| Kenya     | 26.12.2023 | Wtorek       | Boxing Day                                    | Public holiday              |
| Kenya     | 01.01.2027 | Piątek       | New Year's Day                                | Public holiday              |
| Kenya     | 14.04.2030 | Niedziela    | Eid al-Adha (Tentative Date)                  | Optional Holiday            |
| Tonga     | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday              |
| Tonga     | 25.04.2016 | Poniedziałek | ANZAC Day                                     | Public Holiday              |
| Tonga     | 04.12.2019 | Środa        | Anniversary of the Coronation of King Tupou I | Public Holiday              |
| Tonga     | 04.06.2023 | Niedziela    | Emancipation Day                              | Public Holiday              |
| Tonga     | 01.01.2027 | Piątek       | New Year's Day                                | Public Holiday              |
| Tonga     | 04.11.2030 | Poniedziałek | Constitution Day                              | Public Holiday              |
| Belgium   | 06.12.2011 | Wtorek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2013 | Piątek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2015 | Niedziela    | St. Nicholas Day                              | Observance                  |
| Belgium   | 15.11.2017 | Środa        | Day of the German-speaking Community          | Regional government holiday |
| Belgium   | 01.11.2019 | Piątek       | All Saints' Day                               | National holiday            |
| Belgium   | 31.10.2021 | Niedziela    | Halloween                                     | Observance                  |
| Belgium   | 23.09.2023 | Sobota       | September Equinox                             | Season                      |
| Belgium   | 15.08.2025 | Piątek       | Assumption of Mary                            | National holiday            |
| Belgium   | 11.07.2027 | Niedziela    | Day of the Flemish Community                  | Regional government holiday |
| Belgium   | 10.06.2029 | Niedziela    | Father's Day                                  | Observance                  |

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM