通過日期時間列表根據日期列更新 pandas dataframe 列

Question

詳情請參考上述問題。 對於第二個列表中不在第一個列表中的每個假期，我需要在business_days列中添加 0.5 個工作日。 這是一個名為predicted_df的示例輸入 df ：

PredictionTargetDateEOM business_days
0       2022-06-30      22
1       2022-06-30      22
2       2022-06-30      22
3       2022-06-30      22
4       2022-06-30      22
        ... ... ...
172422  2022-11-30      21
172423  2022-11-30      21
172424  2022-11-30      21
172425  2022-11-30      21
172426  2022-11-30      21

PredictionTargetDateEOM 只是該月的最后一天。 business_days是指該月的工作日數，並且該月內的所有行都應該相同。 這里有兩個假期清單。 對於出現在第二個列表中但不是第一個列表中的假期， business_days列應該為該假期月份出現的 dataframe 的每一行添加 +0.5。

rocket_holiday = ["New Year's Day", "Martin Luther King Jr. Day", "Memorial Day", "Independence Day",
                 "Labor Day", "Thanksgiving", "Christmas Day"]
rocket_holiday_including_observed = rocket_holiday + [item + ' (Observed)' for item in rocket_holiday]
print(rocket_holiday_including_observed)
["New Year's Day",
 'Martin Luther King Jr. Day',
 'Memorial Day',
 'Independence Day',
 'Labor Day',
 'Thanksgiving',
 'Christmas Day',
 "New Year's Day (Observed)",
 'Martin Luther King Jr. Day (Observed)',
 'Memorial Day (Observed)',
 'Independence Day (Observed)',
 'Labor Day (Observed)',
 'Thanksgiving (Observed)',
 'Christmas Day (Observed)']

banker_hols = [i for i in holidays.US(years = 2022).values()]
print(banker_hols)
2022-01-01 New Year's Day
2022-01-17 Martin Luther King Jr. Day
2022-02-21 Washington's Birthday
2022-05-30 Memorial Day
2022-06-19 Juneteenth National Independence Day
2022-06-20 Juneteenth National Independence Day (Observed)
2022-07-04 Independence Day
2022-09-05 Labor Day
2022-10-10 Columbus Day
2022-11-11 Veterans Day
2022-11-24 Thanksgiving
2022-12-25 Christmas Day
2022-12-26 Christmas Day (Observed)

第二個列表實際上是通過以下方式從字典中派生的：

import holidays
for name, date in holidays.US(years=2022).items():
    print(name, date)

原始看起來像這樣：

{datetime.date(2022, 1, 1): "New Year's Day", datetime.date(2022, 1, 17): 'Martin Luther King Jr. Day', datetime.date(2022, 2, 21): "Washington's Birthday", datetime.date(2022, 5, 30): 'Memorial Day', datetime.date(2022, 6, 19): 'Juneteenth National Independence Day', datetime.date(2022, 6, 20): 'Juneteenth National Independence Day (Observed)', datetime.date(2022, 7, 4): 'Independence Day', datetime.date(2022, 9, 5): 'Labor Day', datetime.date(2022, 10, 10): 'Columbus Day', datetime.date(2022, 11, 11): 'Veterans Day', datetime.date(2022, 11, 24): 'Thanksgiving', datetime.date(2022, 12, 25): 'Christmas Day', datetime.date(2022, 12, 26): 'Christmas Day (Observed)'}

以下是一個示例 output 以顯示所需的結果：

PredictionTargetDateEOM business_days
0       2022-06-30      22.5
1       2022-06-30      22.5
2       2022-06-30      22.5
3       2022-06-30      22.5
4       2022-06-30      22.5
        ... ... ...
172422  2022-11-30      21.5
172423  2022-11-30      21.5
172424  2022-11-30      21.5
172425  2022-11-30      21.5
172426  2022-11-30      21.5

如您所見，由於 Juneteenth 和退伍軍人節在第二個列表中，但不在第一個列表中，我將在包含 6 月和 11 月作為月份的每一行的“business_days”列中添加 0.5 天。 但是，對於像 7 月或 1 月這樣的兩個列表共享假期的其他月份，這些月份的business_days列應該保持不變。 最后，這種方法對於回填前幾年的歷史數據也應該是穩健的。 我嘗試了以下方法，但它沒有按需要執行。 它將從 dataframe 中刪除整個月份，或者在它不刪除的月份中，在我需要的月份不更改business_days元素。

main_list = list(set(banker_hols) - set(rocket_holiday_including_observed))
print(main_list)

['Columbus Day',
 'Juneteenth National Independence Day',
 "Washington's Birthday",
 'Juneteenth National Independence Day (Observed)',
 'Veterans Day']

result = []
for key, value in holidays.US(years = 2022).items():
    if value in main_list:
        result.append(key)
print(result)

[datetime.date(2022, 2, 21),
 datetime.date(2022, 6, 19),
 datetime.date(2022, 6, 20),
 datetime.date(2022, 10, 10),
 datetime.date(2022, 11, 11)]

所以我有幾個月需要添加 0.5 個工作日，但我不確定如何更新 dataframe 中的business_days列，以獲取屬於這些月份的所有行。

編輯問題在這里解決：如果滿足行條件，則將數量添加到 pandas 列

我的答案包含鏈接問題中顯示的密鑰.loc() function：

#Identify holidays in banker list not in rocket list
banker_hols = [i for i in holidays.US(years = 2022).values()]
hol_diffs = list(set(banker_hols) - set(rocket_holiday_including_observed))

#Extract dates of those holidays
dates_of_hols = []
for key, value in holidays.US(years = 2022).items():
    if value in hol_diffs:
        dates_of_hols.append(key)

#Extract just the months of those holidays
months = []
for item in dates_of_hols:
    months.append(item.month)
months = list(set(months))

#Add 0.5 to business_days for those months
predicted_df.loc[predicted_df['PredictionTargetDateEOM'].dt.month.isin(months), 'business_days'] += 0.5

Answer 1

我們只需要相關節假日的日期：

relevant_holidays = {
    x: y for x, y in holidays.US(years=2022).items() 
    if y not in rocket_holiday_including_observed
}

我們使用 pandas 魔法得到相應的月末日期：

holiday_month_end = pd.to_datetime(
    list(relevant_holidays.keys())
).to_period("M").to_timestamp("M")

DatetimeIndex(['2022-02-28', '2022-06-30', '2022-06-30', '2022-10-31',
               '2022-11-30'],
              dtype='datetime64[ns]', freq=None)

在加入之前，我們每個月計算它們並乘以 0.5：

to_add = holiday_month_end.value_counts() * 0.5

2022-06-30    1.0
2022-02-28    0.5
2022-10-31    0.5
2022-11-30    0.5
dtype: float64

該索引現在是唯一的。 要將其與 dataframe 對齊，請使用reindex ：

predicted_df["business_days"] = predicted_df["business_days"] + to_add.reindex(
    pd.to_datetime(predicted_df["PredictionTargetDateEOM"])
).fillna(0).values

fillna是必要的，因為to_add沒有每個月的條目。 這些values是擺脫索引所必需的，否則+將嘗試匹配索引值而不是保持順序。

Answer 2

這是更模塊化的pythonic解決方案：

my_list = [
    "New Year's Day",
    "Martin Luther King Jr. Day",
    "Memorial Day",
    "Independence Day",
    "Labor Day",
    "Thanksgiving",
    "Christmas Day",
    "New Year's Day (Observed)",
    "Martin Luther King Jr. Day (Observed)",
    "Memorial Day (Observed)",
    "Independence Day (Observed)",
    "Labor Day (Observed)",
    "Thanksgiving (Observed)",
    "Christmas Day (Observed)",
]

# to speed up the search
my_set = set(my_list)

predicted_df['business_days_bankers'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[k for k,v in holidays.US(years=x['PredictionTargetDateBOM'].year).items()]), axis = 1)
predicted_df['business_days_rocket'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[k for k, v in holidays.US(years=x['PredictionTargetDateBOM'].year).items() if v in my_set]), axis = 1)`

cols = ['business_days_bankers', 'business_days_rocket']
predicted_df['business_days_final'] = predicted_df[cols].mean(axis = 1)

通過日期時間列表根據日期列更新 pandas dataframe 列

問題描述

2 個解決方案

解決方案1
0 2022-08-01 18:46:21

解決方案2
0 已采納 2022-08-08 19:34:13

通過日期時間列表根據日期列更新 pandas dataframe 列

問題描述

2 個解決方案

解決方案1 0 2022-08-01 18:46:21

解決方案2 0 已采納 2022-08-08 19:34:13

解決方案1
0 2022-08-01 18:46:21

解決方案2
0 已采納 2022-08-08 19:34:13