簡體   English   中英

如何將url中的表讀取為DataFrame並修改Python中某一列數據的格式 Pandas?

[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?

我有一個帶有如下表格的網站鏈接: https://www.timeanddate.com/holidays/kenya/2022

我怎么能夠:

  1. 在 Python 中的 Jupyter Notebook 中將此表讀取為 DataFrame?
  2. 轉換“日期”列,使其日期格式類似於“01.01.2022”,而不是網站“1 sty”上存在的日期格式
  3. 如何創建“Day”列,其中值如:sobota、niedziela 等當前位於“Date”和“Name”列之間?

因此,結果我需要如下內容:

日期 姓名 類型
01.01.2022 蕎麥面 元旦 公共假期
20.03.2022 涅傑拉 三月春分 季節
... ... ... ...

我怎樣才能在 Python Pandas 中做到這一點?

多虧了 beautifulsoup 庫,你可以做到這一點......如果你用谷歌瀏覽器在 web 頁面上點擊右鍵,你可以看到 web 頁面的結構,它結構良好,很容易在 html 標簽之間提取數據。 另外,如果要提取所有年份的數據,只需循環web url即可。

https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...

要在Jupyter Notebook中讀取網站上的表格為DataFrame,可以直接使用pandas庫。 您可以嘗試與此類似的操作:

from datetime import datetime as dt
import pandas as pd

# Year
year = "2022"

# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]

# Drop NaN
df = df.dropna()

# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]

# Display the DataFrame
df

將此表讀作 DataFrame

您可能可以直接使用pandas.read_html

# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]

並通過重置列標題和刪除空行來清理一下:

khdf = khdf.set_axis(
    ['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')

轉換“日期”列,使其具有類似“01.01.2022”的日期格式

您可以使用dateutil.parser解析日期,然后使用.strftime對其進行格式化。

# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]

如何創建列“Day”,其中的值如下:sobota、niedziela 等

到目前為止,我們已經有了包含星期一/星期二等的Day列,但是如果您想要波蘭語的它們,您可以使用翻譯詞典 [例如下面的daysDict ]。

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]

如果您想翻譯所有內容[ Date除外],您可以使用googletrans模塊。 (我認為默認安裝的版本有一些問題,但3.1.0a0適合我。)

# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
    khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]


[因為你評論] “帶循環的代碼示例”

由於頁面鏈接具有一致的格式,您可以遍歷不同的國家和年份。

首先,導入必要的庫並定義翻譯字典以及嘗試解析和格式化日期的 function(但如果失敗則返回 null 值( None )):

import pandas
from dateutil.parser import parse as duParse

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}

def try_dup(dStr, yr):
    try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y') 
    except: return None

然后,設置開始和結束年份以及國家列表:

startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']

現在,我們准備好循環遍歷國家和年份來收集數據:

dfList = []
for country in countryList:
    for y in range(startYear, endYear+1):
        try: 
            cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
            cydf = pandas.read_html(cyUrl)[0]

            cydf = cydf.drop(# only the first 4 columns are kept
                [c for c in cydf.columns[4:]], axis='columns'
            ).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
            cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
            cydf['Country'] = country.capitalize() # add+fill a column with country name

            dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
            # print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
        # except: continue ## skip without printing error
        except Exception as e: 
            print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')

循環之后,所有的DataFrames可以在轉換日期之前合並為一個:

acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns

acydf的示例 [使用print(acydf.loc[::66].to_markdown(index=False)) ]:

| Country   | Date       | Day          | Name                                          | Type                        |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya     | 01.01.2012 | Niedziela    | New Year's Day                                | Public holiday              |
| Kenya     | 19.07.2015 | Niedziela    | Eid al-Fitr                                   | Public holiday              |
| Kenya     | 10.10.2018 | Środa        | Moi Day                                       | Public holiday              |
| Kenya     | 10.10.2021 | Niedziela    | Huduma Day                                    | Public holiday              |
| Kenya     | 26.12.2023 | Wtorek       | Boxing Day                                    | Public holiday              |
| Kenya     | 01.01.2027 | Piątek       | New Year's Day                                | Public holiday              |
| Kenya     | 14.04.2030 | Niedziela    | Eid al-Adha (Tentative Date)                  | Optional Holiday            |
| Tonga     | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday              |
| Tonga     | 25.04.2016 | Poniedziałek | ANZAC Day                                     | Public Holiday              |
| Tonga     | 04.12.2019 | Środa        | Anniversary of the Coronation of King Tupou I | Public Holiday              |
| Tonga     | 04.06.2023 | Niedziela    | Emancipation Day                              | Public Holiday              |
| Tonga     | 01.01.2027 | Piątek       | New Year's Day                                | Public Holiday              |
| Tonga     | 04.11.2030 | Poniedziałek | Constitution Day                              | Public Holiday              |
| Belgium   | 06.12.2011 | Wtorek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2013 | Piątek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2015 | Niedziela    | St. Nicholas Day                              | Observance                  |
| Belgium   | 15.11.2017 | Środa        | Day of the German-speaking Community          | Regional government holiday |
| Belgium   | 01.11.2019 | Piątek       | All Saints' Day                               | National holiday            |
| Belgium   | 31.10.2021 | Niedziela    | Halloween                                     | Observance                  |
| Belgium   | 23.09.2023 | Sobota       | September Equinox                             | Season                      |
| Belgium   | 15.08.2025 | Piątek       | Assumption of Mary                            | National holiday            |
| Belgium   | 11.07.2027 | Niedziela    | Day of the Flemish Community                  | Regional government holiday |
| Belgium   | 10.06.2029 | Niedziela    | Father's Day                                  | Observance                  |

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM