![](/img/trans.png)
[英]How to modify date format column in pandas dataframe to a int using python
[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?
我有一個帶有如下表格的網站鏈接: https://www.timeanddate.com/holidays/kenya/2022
我怎么能夠:
因此,結果我需要如下內容:
日期 | 日 | 姓名 | 類型 |
---|---|---|---|
01.01.2022 | 蕎麥面 | 元旦 | 公共假期 |
20.03.2022 | 涅傑拉 | 三月春分 | 季節 |
... | ... | ... | ... |
我怎樣才能在 Python Pandas 中做到這一點?
多虧了 beautifulsoup 庫,你可以做到這一點......如果你用谷歌瀏覽器在 web 頁面上點擊右鍵,你可以看到 web 頁面的結構,它結構良好,很容易在 html 標簽之間提取數據。 另外,如果要提取所有年份的數據,只需循環web url即可。
https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...
要在Jupyter Notebook中讀取網站上的表格為DataFrame,可以直接使用pandas庫。 您可以嘗試與此類似的操作:
from datetime import datetime as dt
import pandas as pd
# Year
year = "2022"
# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]
# Drop NaN
df = df.dropna()
# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]
# Display the DataFrame
df
將此表讀作 DataFrame
您可能可以直接使用pandas.read_html
。
# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]
並通過重置列標題和刪除空行來清理一下:
khdf = khdf.set_axis(
['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')
轉換“日期”列,使其具有類似“01.01.2022”的日期格式
您可以使用dateutil.parser
解析日期,然后使用.strftime
對其進行格式化。
# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]
如何創建列“Day”,其中的值如下:sobota、niedziela 等
到目前為止,我們已經有了包含星期一/星期二等的Day
列,但是如果您想要波蘭語的它們,您可以使用翻譯詞典 [例如下面的daysDict
]。
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]
如果您想翻譯所有內容[ Date
除外],您可以使用googletrans
模塊。 (我認為默認安裝的版本有一些問題,但3.1.0a0
適合我。)
# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]
[因為你評論] “帶循環的代碼示例”
由於頁面鏈接具有一致的格式,您可以遍歷不同的國家和年份。
首先,導入必要的庫並定義翻譯字典以及嘗試解析和格式化日期的 function(但如果失敗則返回 null 值( None
)):
import pandas
from dateutil.parser import parse as duParse
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
def try_dup(dStr, yr):
try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y')
except: return None
然后,設置開始和結束年份以及國家列表:
startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']
現在,我們准備好循環遍歷國家和年份來收集數據:
dfList = []
for country in countryList:
for y in range(startYear, endYear+1):
try:
cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
cydf = pandas.read_html(cyUrl)[0]
cydf = cydf.drop(# only the first 4 columns are kept
[c for c in cydf.columns[4:]], axis='columns'
).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
cydf['Country'] = country.capitalize() # add+fill a column with country name
dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
# print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
# except: continue ## skip without printing error
except Exception as e:
print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')
循環之后,所有的DataFrames可以在轉換日期之前合並為一個:
acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns
acydf
的示例 [使用print(acydf.loc[::66].to_markdown(index=False))
]:
| Country | Date | Day | Name | Type |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya | 01.01.2012 | Niedziela | New Year's Day | Public holiday |
| Kenya | 19.07.2015 | Niedziela | Eid al-Fitr | Public holiday |
| Kenya | 10.10.2018 | Środa | Moi Day | Public holiday |
| Kenya | 10.10.2021 | Niedziela | Huduma Day | Public holiday |
| Kenya | 26.12.2023 | Wtorek | Boxing Day | Public holiday |
| Kenya | 01.01.2027 | Piątek | New Year's Day | Public holiday |
| Kenya | 14.04.2030 | Niedziela | Eid al-Adha (Tentative Date) | Optional Holiday |
| Tonga | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday |
| Tonga | 25.04.2016 | Poniedziałek | ANZAC Day | Public Holiday |
| Tonga | 04.12.2019 | Środa | Anniversary of the Coronation of King Tupou I | Public Holiday |
| Tonga | 04.06.2023 | Niedziela | Emancipation Day | Public Holiday |
| Tonga | 01.01.2027 | Piątek | New Year's Day | Public Holiday |
| Tonga | 04.11.2030 | Poniedziałek | Constitution Day | Public Holiday |
| Belgium | 06.12.2011 | Wtorek | St. Nicholas Day | Observance |
| Belgium | 06.12.2013 | Piątek | St. Nicholas Day | Observance |
| Belgium | 06.12.2015 | Niedziela | St. Nicholas Day | Observance |
| Belgium | 15.11.2017 | Środa | Day of the German-speaking Community | Regional government holiday |
| Belgium | 01.11.2019 | Piątek | All Saints' Day | National holiday |
| Belgium | 31.10.2021 | Niedziela | Halloween | Observance |
| Belgium | 23.09.2023 | Sobota | September Equinox | Season |
| Belgium | 15.08.2025 | Piątek | Assumption of Mary | National holiday |
| Belgium | 11.07.2027 | Niedziela | Day of the Flemish Community | Regional government holiday |
| Belgium | 10.06.2029 | Niedziela | Father's Day | Observance |
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.