![](/img/trans.png)
[英]How to modify date format column in pandas dataframe to a int using python
[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?
我有一个带有如下表格的网站链接: https://www.timeanddate.com/holidays/kenya/2022
我怎么能够:
因此,结果我需要如下内容:
日期 | 日 | 姓名 | 类型 |
---|---|---|---|
01.01.2022 | 荞麦面 | 元旦 | 公共假期 |
20.03.2022 | 涅杰拉 | 三月春分 | 季节 |
... | ... | ... | ... |
我怎样才能在 Python Pandas 中做到这一点?
多亏了 beautifulsoup 库,你可以做到这一点......如果你用谷歌浏览器在 web 页面上点击右键,你可以看到 web 页面的结构,它结构良好,很容易在 html 标签之间提取数据。 另外,如果要提取所有年份的数据,只需循环web url即可。
https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...
要在Jupyter Notebook中读取网站上的表格为DataFrame,可以直接使用pandas库。 您可以尝试与此类似的操作:
from datetime import datetime as dt
import pandas as pd
# Year
year = "2022"
# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]
# Drop NaN
df = df.dropna()
# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]
# Display the DataFrame
df
将此表读作 DataFrame
您可能可以直接使用pandas.read_html
。
# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]
并通过重置列标题和删除空行来清理一下:
khdf = khdf.set_axis(
['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')
转换“日期”列,使其具有类似“01.01.2022”的日期格式
您可以使用dateutil.parser
解析日期,然后使用.strftime
对其进行格式化。
# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]
如何创建列“Day”,其中的值如下:sobota、niedziela 等
到目前为止,我们已经有了包含星期一/星期二等的Day
列,但是如果您想要波兰语的它们,您可以使用翻译词典 [例如下面的daysDict
]。
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]
如果您想翻译所有内容[ Date
除外],您可以使用googletrans
模块。 (我认为默认安装的版本有一些问题,但3.1.0a0
适合我。)
# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]
[因为你评论] “带循环的代码示例”
由于页面链接具有一致的格式,您可以遍历不同的国家和年份。
首先,导入必要的库并定义翻译字典以及尝试解析和格式化日期的 function(但如果失败则返回 null 值( None
)):
import pandas
from dateutil.parser import parse as duParse
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
def try_dup(dStr, yr):
try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y')
except: return None
然后,设置开始和结束年份以及国家列表:
startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']
现在,我们准备好循环遍历国家和年份来收集数据:
dfList = []
for country in countryList:
for y in range(startYear, endYear+1):
try:
cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
cydf = pandas.read_html(cyUrl)[0]
cydf = cydf.drop(# only the first 4 columns are kept
[c for c in cydf.columns[4:]], axis='columns'
).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
cydf['Country'] = country.capitalize() # add+fill a column with country name
dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
# print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
# except: continue ## skip without printing error
except Exception as e:
print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')
循环之后,所有的DataFrames可以在转换日期之前合并为一个:
acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns
acydf
的示例 [使用print(acydf.loc[::66].to_markdown(index=False))
]:
| Country | Date | Day | Name | Type |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya | 01.01.2012 | Niedziela | New Year's Day | Public holiday |
| Kenya | 19.07.2015 | Niedziela | Eid al-Fitr | Public holiday |
| Kenya | 10.10.2018 | Środa | Moi Day | Public holiday |
| Kenya | 10.10.2021 | Niedziela | Huduma Day | Public holiday |
| Kenya | 26.12.2023 | Wtorek | Boxing Day | Public holiday |
| Kenya | 01.01.2027 | Piątek | New Year's Day | Public holiday |
| Kenya | 14.04.2030 | Niedziela | Eid al-Adha (Tentative Date) | Optional Holiday |
| Tonga | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday |
| Tonga | 25.04.2016 | Poniedziałek | ANZAC Day | Public Holiday |
| Tonga | 04.12.2019 | Środa | Anniversary of the Coronation of King Tupou I | Public Holiday |
| Tonga | 04.06.2023 | Niedziela | Emancipation Day | Public Holiday |
| Tonga | 01.01.2027 | Piątek | New Year's Day | Public Holiday |
| Tonga | 04.11.2030 | Poniedziałek | Constitution Day | Public Holiday |
| Belgium | 06.12.2011 | Wtorek | St. Nicholas Day | Observance |
| Belgium | 06.12.2013 | Piątek | St. Nicholas Day | Observance |
| Belgium | 06.12.2015 | Niedziela | St. Nicholas Day | Observance |
| Belgium | 15.11.2017 | Środa | Day of the German-speaking Community | Regional government holiday |
| Belgium | 01.11.2019 | Piątek | All Saints' Day | National holiday |
| Belgium | 31.10.2021 | Niedziela | Halloween | Observance |
| Belgium | 23.09.2023 | Sobota | September Equinox | Season |
| Belgium | 15.08.2025 | Piątek | Assumption of Mary | National holiday |
| Belgium | 11.07.2027 | Niedziela | Day of the Flemish Community | Regional government holiday |
| Belgium | 10.06.2029 | Niedziela | Father's Day | Observance |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.