简体   繁体   English

如何将url中的表读取为DataFrame并修改Python中某一列数据的格式 Pandas?

[英]How to read table from url as DataFrame and modify format of data in one column in Python Pandas?

I have a link to the website with table like the follow: https://www.timeanddate.com/holidays/kenya/2022我有一个带有如下表格的网站链接: https://www.timeanddate.com/holidays/kenya/2022

How can I:我怎么能够:

  1. read this table as DataFrame in Jupyter Notebook in Python?在 Python 中的 Jupyter Notebook 中将此表读取为 DataFrame?
  2. Convert column "Date" so as to have date format like "01.01.2022" not as exists on website "1 sty"转换“日期”列,使其日期格式类似于“01.01.2022”,而不是网站“1 sty”上存在的日期格式
  3. how to create column "Day" where will be value like: sobota, niedziela and so on which currently are between columns "Date" and "Name"?如何创建“Day”列,其中值如:sobota、niedziela 等当前位于“Date”和“Name”列之间?

So, as a result I need something like below:因此,结果我需要如下内容:

Date日期 Day Name姓名 Type类型
01.01.2022 01.01.2022 sobota荞麦面 New Year's Day元旦 Public holiday公共假期
20.03.2022 20.03.2022 niedziela涅杰拉 March Equinox三月春分 Season季节
... ... ... ... ... ... ... ...

How can I do that in Python Pandas?我怎样才能在 Python Pandas 中做到这一点?

You can do this thanks to beautifulsoup library... If you click right in the web page with google chrome, you can see the structure of the web page, it's well structured, and easy to extract data between html tags.多亏了 beautifulsoup 库,你可以做到这一点......如果你用谷歌浏览器在 web 页面上点击右键,你可以看到 web 页面的结构,它结构良好,很容易在 html 标签之间提取数据。 Also, if you want to extract data of all years, just loop on the web url.另外,如果要提取所有年份的数据,只需循环web url即可。

https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ... https://www.timeanddate.com/holidays/kenya/2022 https://www.timeanddate.com/holidays/kenya/2021 ...

To read the table on the website as a DataFrame in Jupyter Notebook, you can directly use the pandas library.要在Jupyter Notebook中读取网站上的表格为DataFrame,可以直接使用pandas库。 You can try something similar to this:您可以尝试与此类似的操作:

from datetime import datetime as dt
import pandas as pd

# Year
year = "2022"

# Read the table on the website into a DataFrame
df = pd.read_html("https://www.timeanddate.com/holidays/kenya/"+year)[0]

# Drop NaN
df = df.dropna()

# Convert the "Date" column to the desired date format
df["Date"] = df["Date"].apply(lambda date: date + " " + year)
df["Date"] = [dt.strptime(df["Date"].iloc[i][0], "%b %d %Y") for i in range(0, len(df))]

# Display the DataFrame
df

read this table as DataFrame将此表读作 DataFrame

You can probably just use pandas.read_html directly.您可能可以直接使用pandas.read_html

# import pandas
khdf = pandas.read_html('https://www.timeanddate.com/holidays/kenya/2022')[0]

and to clean up a bit by resetting the column headers and getting rid of empty rows:并通过重置列标题和删除空行来清理一下:

khdf = khdf.set_axis(
    ['Date', 'Day', 'Name', 'Type'], axis='columns'
).dropna(axis='rows', how='all')

Convert column "Date" so as to have date format like "01.01.2022"转换“日期”列,使其具有类似“01.01.2022”的日期格式

You can parse the date with dateutil.parser and then format it with .strftime .您可以使用dateutil.parser解析日期,然后使用.strftime对其进行格式化。

# from dateutil.parser import parse as duParse
y = 2022
khdf['Date'] = [duParse(f'{y} {d}').strftime('%d.%m.%Y') for d in khdf['Date']]

how to create column "Day" where will be value like: sobota, niedziela and so on如何创建列“Day”,其中的值如下:sobota、niedziela 等

As it is so far, we already have a Day column with Monday/Tuesday/etc., but if you want them in Polish, you could use a translation dictionary [like daysDict below].到目前为止,我们已经有了包含星期一/星期二等的Day列,但是如果您想要波兰语的它们,您可以使用翻译词典 [例如下面的daysDict ]。

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}
khdf['Day'] = [daysDict[d] if d in daysDict else d for d in khdf['Day']]

If you want to translate everything [except for Date ], you could use the googletrans module.如果您想翻译所有内容[ Date除外],您可以使用googletrans模块。 (I think the version installed by default has some issues, but 3.1.0a0 works for me.) (我认为默认安装的版本有一些问题,但3.1.0a0适合我。)

# !pip install googletrans==3.1.0a0
# from googletrans import Translator
translator = Translator()
for c in ['Day', 'Name', 'Type']:
    khdf[c] = [translator.translate(d, src='en', dest='pl').text for d in khdf[c]]


[because you commented about] "sample of code with loop" [因为你评论] “带循环的代码示例”

Since the page links have a consistent format, you can loop through various countries and years.由于页面链接具有一致的格式,您可以遍历不同的国家和年份。

First, import the necessary libraries and define the translation dictionary along with a function that tries to parse and format the date (but returns a null value ( None ) if it fails):首先,导入必要的库并定义翻译字典以及尝试解析和格式化日期的 function(但如果失败则返回 null 值( None )):

import pandas
from dateutil.parser import parse as duParse

daysDict = {'Monday': 'Poniedziałek', 'Tuesday': 'Wtorek', 'Wednesday': 'Środa', 'Thursday': 'Czwartek', 'Friday': 'Piątek', 'Saturday': 'Sobota', 'Sunday': 'Niedziela'}

def try_dup(dStr, yr):
    try: return duParse(f'{yr} {dStr}').strftime('%d.%m.%Y') 
    except: return None

then, set the start and end years as well as a list of countries:然后,设置开始和结束年份以及国家列表:

startYear, endYear = 2010, 2030
countryList = ['kenya', 'tonga', 'belgium']

now, we're ready to loop though the countries and years to collect data:现在,我们准备好循环遍历国家和年份来收集数据:

dfList = []
for country in countryList:
    for y in range(startYear, endYear+1):
        try: 
            cyUrl = f'https://www.timeanddate.com/holidays/{country}/{y}'
            cydf = pandas.read_html(cyUrl)[0]

            cydf = cydf.drop(# only the first 4 columns are kept
                [c for c in cydf.columns[4:]], axis='columns'
            ).set_axis(['Date', 'Day', 'Name', 'Type'], axis='columns')
            cydf['Date'] = [try_dup(d, y) for d in cydf['Date']] # parse+format date
            cydf['Country'] = country.capitalize() # add+fill a column with country name

            dfList.append(cydf.dropna(axis='rows', subset=['Date'])) # only add rows with Date
            # print('', end=f'\r{len(dfList[-1])} holidays scraped from {cyUrl}')
        # except: continue ## skip without printing error
        except Exception as e: 
            print('\n', type(e), e, '- failed to scrape from', cyUrl)
# print('\n\n', len(dfList), 'dataframes with', sum([len(d) for d in dfList]),'holidays scraped overall')

After looping, all the DataFrames can be combined into one before translating the days:循环之后,所有的DataFrames可以在转换日期之前合并为一个:

acydf = pandas.concat(dfList, ignore_index=True)
acydf['Day'] = [daysDict[d] if d in daysDict else d for d in acydf['Day']] # translate days
acydf = acydf[['Country', 'Date', 'Day', 'Name', 'Type']] # rearrange columns

A sample of acydf [printed with print(acydf.loc[::66].to_markdown(index=False)) ]: acydf的示例 [使用print(acydf.loc[::66].to_markdown(index=False)) ]:

| Country   | Date       | Day          | Name                                          | Type                        |
|:----------|:-----------|:-------------|:----------------------------------------------|:----------------------------|
| Kenya     | 01.01.2012 | Niedziela    | New Year's Day                                | Public holiday              |
| Kenya     | 19.07.2015 | Niedziela    | Eid al-Fitr                                   | Public holiday              |
| Kenya     | 10.10.2018 | Środa        | Moi Day                                       | Public holiday              |
| Kenya     | 10.10.2021 | Niedziela    | Huduma Day                                    | Public holiday              |
| Kenya     | 26.12.2023 | Wtorek       | Boxing Day                                    | Public holiday              |
| Kenya     | 01.01.2027 | Piątek       | New Year's Day                                | Public holiday              |
| Kenya     | 14.04.2030 | Niedziela    | Eid al-Adha (Tentative Date)                  | Optional Holiday            |
| Tonga     | 17.09.2012 | Poniedziałek | Birthday of Crown Prince Tupouto'a-'Ulukalala | Public Holiday              |
| Tonga     | 25.04.2016 | Poniedziałek | ANZAC Day                                     | Public Holiday              |
| Tonga     | 04.12.2019 | Środa        | Anniversary of the Coronation of King Tupou I | Public Holiday              |
| Tonga     | 04.06.2023 | Niedziela    | Emancipation Day                              | Public Holiday              |
| Tonga     | 01.01.2027 | Piątek       | New Year's Day                                | Public Holiday              |
| Tonga     | 04.11.2030 | Poniedziałek | Constitution Day                              | Public Holiday              |
| Belgium   | 06.12.2011 | Wtorek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2013 | Piątek       | St. Nicholas Day                              | Observance                  |
| Belgium   | 06.12.2015 | Niedziela    | St. Nicholas Day                              | Observance                  |
| Belgium   | 15.11.2017 | Środa        | Day of the German-speaking Community          | Regional government holiday |
| Belgium   | 01.11.2019 | Piątek       | All Saints' Day                               | National holiday            |
| Belgium   | 31.10.2021 | Niedziela    | Halloween                                     | Observance                  |
| Belgium   | 23.09.2023 | Sobota       | September Equinox                             | Season                      |
| Belgium   | 15.08.2025 | Piątek       | Assumption of Mary                            | National holiday            |
| Belgium   | 11.07.2027 | Niedziela    | Day of the Flemish Community                  | Regional government holiday |
| Belgium   | 10.06.2029 | Niedziela    | Father's Day                                  | Observance                  |

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM