繁体   English   中英

如何将url中的表读取为DataFrame并修改Python中某一列数据的格式 Pandas?

[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?

我有一个带有如下表格的网站链接: https://www.timeanddate.com/holidays/kenya/2022

我怎么能够:

  1. 在 Python 中的 Jupyter Notebook 中将此表读取为 DataFrame?
  2. 转换“日期”列,使其日期格式类似于“01.01.2022”,而不是网站“1 sty”上存在的日期格式
  3. 如何创建“Day”列,其中值如:sobota、niedziela 等当前位于“Date”和“Name”列之间?

因此,结果我需要如下内容:

日期 姓名 类型
01.01.2022 荞麦面 元旦 公共假期
20.03.2022 涅杰拉 三月春分 季节
... ... ... ...

我怎样才能在 Python Pandas 中做到这一点?

多亏了 beautifulsoup 库,你可以做到这一点......如果你用谷歌浏览器在 web 页面上点击右键,你可以看到 web 页面的结构,它结构良好,很容易在 html 标签之间提取数据。 另外,如果要提取所有年份的数据,只需循环web url即可。

https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...

要在Jupyter Notebook中读取网站上的表格为DataFrame,可以直接使用pandas库。 您可以尝试与此类似的操作:

from datetime import datetime as dt
import pandas as pd

# Year
year = "2022"

# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]

# Drop NaN
df = df.dropna()

# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]

# Display the DataFrame
df

将此表读作 DataFrame

您可能可以直接使用pandas.read_html

# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]

并通过重置列标题和删除空行来清理一下:

khdf = khdf.set_axis(
    ['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')

转换“日期”列,使其具有类似“01.01.2022”的日期格式

您可以使用dateutil.parser解析日期,然后使用.strftime对其进行格式化。

# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]

如何创建列“Day”,其中的值如下:sobota、niedziela 等

到目前为止,我们已经有了包含星期一/星期二等的Day列,但是如果您想要波兰语的它们,您可以使用翻译词典 [例如下面的daysDict ]。

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]

如果您想翻译所有内容[ Date除外],您可以使用googletrans模块。 (我认为默认安装的版本有一些问题,但3.1.0a0适合我。)

# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
    khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]


[因为你评论] “带循环的代码示例”

由于页面链接具有一致的格式,您可以遍历不同的国家和年份。

首先,导入必要的库并定义翻译字典以及尝试解析和格式化日期的 function(但如果失败则返回 null 值( None )):

import pandas
from dateutil.parser import parse as duParse

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}

def try_dup(dStr, yr):
    try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y') 
    except: return None

然后,设置开始和结束年份以及国家列表:

startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']

现在,我们准备好循环遍历国家和年份来收集数据:

dfList = []
for country in countryList:
    for y in range(startYear, endYear+1):
        try: 
            cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
            cydf = pandas.read_html(cyUrl)[0]

            cydf = cydf.drop(# only the first 4 columns are kept
                [c for c in cydf.columns[4:]], axis='columns'
            ).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
            cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
            cydf['Country'] = country.capitalize() # add+fill a column with country name

            dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
            # print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
        # except: continue ## skip without printing error
        except Exception as e: 
            print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')

循环之后,所有的DataFrames可以在转换日期之前合并为一个:

acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns

acydf的示例 [使用print(acydf.loc[::66].to_markdown(index=False)) ]:

| Country   | Date       | Day          | Name                                          | Type                        |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya     | 01.01.2012 | Niedziela    | New Year's Day                                | Public holiday              |
| Kenya     | 19.07.2015 | Niedziela    | Eid al-Fitr                                   | Public holiday              |
| Kenya     | 10.10.2018 | Środa        | Moi Day                                       | Public holiday              |
| Kenya     | 10.10.2021 | Niedziela    | Huduma Day                                    | Public holiday              |
| Kenya     | 26.12.2023 | Wtorek       | Boxing Day                                    | Public holiday              |
| Kenya     | 01.01.2027 | Piątek       | New Year's Day                                | Public holiday              |
| Kenya     | 14.04.2030 | Niedziela    | Eid al-Adha (Tentative Date)                  | Optional Holiday            |
| Tonga     | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday              |
| Tonga     | 25.04.2016 | Poniedziałek | ANZAC Day                                     | Public Holiday              |
| Tonga     | 04.12.2019 | Środa        | Anniversary of the Coronation of King Tupou I | Public Holiday              |
| Tonga     | 04.06.2023 | Niedziela    | Emancipation Day                              | Public Holiday              |
| Tonga     | 01.01.2027 | Piątek       | New Year's Day                                | Public Holiday              |
| Tonga     | 04.11.2030 | Poniedziałek | Constitution Day                              | Public Holiday              |
| Belgium   | 06.12.2011 | Wtorek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2013 | Piątek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2015 | Niedziela    | St. Nicholas Day                              | Observance                  |
| Belgium   | 15.11.2017 | Środa        | Day of the German-speaking Community          | Regional government holiday |
| Belgium   | 01.11.2019 | Piątek       | All Saints' Day                               | National holiday            |
| Belgium   | 31.10.2021 | Niedziela    | Halloween                                     | Observance                  |
| Belgium   | 23.09.2023 | Sobota       | September Equinox                             | Season                      |
| Belgium   | 15.08.2025 | Piątek       | Assumption of Mary                            | National holiday            |
| Belgium   | 11.07.2027 | Niedziela    | Day of the Flemish Community                  | Regional government holiday |
| Belgium   | 10.06.2029 | Niedziela    | Father's Day                                  | Observance                  |

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM