[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?
I have a link to the website with table like the follow: https://www.timeanddate.com/holidays/kenya/2022我有一个带有如下表格的网站链接: https://www.timeanddate.com/holidays/kenya/2022
How can I:我怎么能够:
So, as a result I need something like below:因此,结果我需要如下内容:
Date![]() |
Day![]() |
Name![]() |
Type![]() |
---|---|---|---|
01.01.2022 ![]() |
sobota![]() |
New Year's Day![]() |
Public holiday![]() |
20.03.2022 ![]() |
niedziela![]() |
March Equinox![]() |
Season![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
How can I do that in Python Pandas?我怎样才能在 Python Pandas 中做到这一点?
You can do this thanks to beautifulsoup library... If you click right in the web page with google chrome, you can see the structure of the web page, it's well structured, and easy to extract data between html tags.多亏了 beautifulsoup 库,你可以做到这一点......如果你用谷歌浏览器在 web 页面上点击右键,你可以看到 web 页面的结构,它结构良好,很容易在 html 标签之间提取数据。 Also, if you want to extract data of all years, just loop on the web url.
另外,如果要提取所有年份的数据,只需循环web url即可。
https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...
https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...
To read the table on the website as a DataFrame in Jupyter Notebook, you can directly use the pandas library.要在Jupyter Notebook中读取网站上的表格为DataFrame,可以直接使用pandas库。 You can try something similar to this:
您可以尝试与此类似的操作:
from datetime import datetime as dt
import pandas as pd
# Year
year = "2022"
# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]
# Drop NaN
df = df.dropna()
# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]
# Display the DataFrame
df
read this table as DataFrame
将此表读作 DataFrame
You can probably just use pandas.read_html
directly.您可能可以直接使用
pandas.read_html
。
# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]
and to clean up a bit by resetting the column headers and getting rid of empty rows:并通过重置列标题和删除空行来清理一下:
khdf = khdf.set_axis(
['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')
Convert column "Date" so as to have date format like "01.01.2022"
转换“日期”列,使其具有类似“01.01.2022”的日期格式
You can parse the date with dateutil.parser
and then format it with .strftime
.您可以使用
dateutil.parser
解析日期,然后使用.strftime
对其进行格式化。
# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]
how to create column "Day" where will be value like: sobota, niedziela and so on
如何创建列“Day”,其中的值如下:sobota、niedziela 等
As it is so far, we already have a Day
column with Monday/Tuesday/etc., but if you want them in Polish, you could use a translation dictionary [like daysDict
below].到目前为止,我们已经有了包含星期一/星期二等的
Day
列,但是如果您想要波兰语的它们,您可以使用翻译词典 [例如下面的daysDict
]。
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]
If you want to translate everything [except for Date
], you could use the googletrans
module.如果您想翻译所有内容[
Date
除外],您可以使用googletrans
模块。 (I think the version installed by default has some issues, but 3.1.0a0
works for me.) (我认为默认安装的版本有一些问题,但
3.1.0a0
适合我。)
# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]
[because you commented about] "sample of code with loop"
[因为你评论] “带循环的代码示例”
Since the page links have a consistent format, you can loop through various countries and years.由于页面链接具有一致的格式,您可以遍历不同的国家和年份。
First, import the necessary libraries and define the translation dictionary along with a function that tries to parse and format the date (but returns a null value ( None
) if it fails):首先,导入必要的库并定义翻译字典以及尝试解析和格式化日期的 function(但如果失败则返回 null 值(
None
)):
import pandas
from dateutil.parser import parse as duParse
daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
def try_dup(dStr, yr):
try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y')
except: return None
then, set the start and end years as well as a list of countries:然后,设置开始和结束年份以及国家列表:
startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']
now, we're ready to loop though the countries and years to collect data:现在,我们准备好循环遍历国家和年份来收集数据:
dfList = []
for country in countryList:
for y in range(startYear, endYear+1):
try:
cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
cydf = pandas.read_html(cyUrl)[0]
cydf = cydf.drop(# only the first 4 columns are kept
[c for c in cydf.columns[4:]], axis='columns'
).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
cydf['Country'] = country.capitalize() # add+fill a column with country name
dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
# print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
# except: continue ## skip without printing error
except Exception as e:
print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')
After looping, all the DataFrames can be combined into one before translating the days:循环之后,所有的DataFrames可以在转换日期之前合并为一个:
acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns
A sample of acydf
[printed with print(acydf.loc[::66].to_markdown(index=False))
]: acydf
的示例 [使用print(acydf.loc[::66].to_markdown(index=False))
]:
| Country | Date | Day | Name | Type |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya | 01.01.2012 | Niedziela | New Year's Day | Public holiday |
| Kenya | 19.07.2015 | Niedziela | Eid al-Fitr | Public holiday |
| Kenya | 10.10.2018 | Środa | Moi Day | Public holiday |
| Kenya | 10.10.2021 | Niedziela | Huduma Day | Public holiday |
| Kenya | 26.12.2023 | Wtorek | Boxing Day | Public holiday |
| Kenya | 01.01.2027 | Piątek | New Year's Day | Public holiday |
| Kenya | 14.04.2030 | Niedziela | Eid al-Adha (Tentative Date) | Optional Holiday |
| Tonga | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday |
| Tonga | 25.04.2016 | Poniedziałek | ANZAC Day | Public Holiday |
| Tonga | 04.12.2019 | Środa | Anniversary of the Coronation of King Tupou I | Public Holiday |
| Tonga | 04.06.2023 | Niedziela | Emancipation Day | Public Holiday |
| Tonga | 01.01.2027 | Piątek | New Year's Day | Public Holiday |
| Tonga | 04.11.2030 | Poniedziałek | Constitution Day | Public Holiday |
| Belgium | 06.12.2011 | Wtorek | St. Nicholas Day | Observance |
| Belgium | 06.12.2013 | Piątek | St. Nicholas Day | Observance |
| Belgium | 06.12.2015 | Niedziela | St. Nicholas Day | Observance |
| Belgium | 15.11.2017 | Środa | Day of the German-speaking Community | Regional government holiday |
| Belgium | 01.11.2019 | Piątek | All Saints' Day | National holiday |
| Belgium | 31.10.2021 | Niedziela | Halloween | Observance |
| Belgium | 23.09.2023 | Sobota | September Equinox | Season |
| Belgium | 15.08.2025 | Piątek | Assumption of Mary | National holiday |
| Belgium | 11.07.2027 | Niedziela | Day of the Flemish Community | Regional government holiday |
| Belgium | 10.06.2029 | Niedziela | Father's Day | Observance |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.