[英]resample() over MultiIndex
我有一個 DataFrame df
,它有一個三級多索引。 最內層是日期時間。
value data_1 data_2 data_3 data_4
id_1 id_2 effective_date
ADH10685 CA1P0 2018-07-31 0.000048 17901701 3mra Actual 198.00
2018-08-31 0.000048 17901701 3mra Actual 198.00
CB0N0 2018-07-31 4.010784 17901701 3mra Actual 0.01
2018-08-31 2.044298 17901701 3mra Actual 0.01
2018-10-31 11.493831 17901701 3mra Actual 0.01
2018-11-30 13.929844 17901701 3mra Actual 0.01
2018-12-31 21.500490 17901701 3mra Actual 0.01
CB0P0 2018-07-31 22.389493 17901701 3mra Actual 0.03
2018-08-31 23.600726 17901701 3mra Actual 0.03
2018-09-30 45.105458 17901701 3mra Actual 0.03
2018-10-31 32.249056 17901701 3mra Actual 0.03
2018-11-30 60.790889 17901701 3mra Actual 0.03
2018-12-31 46.832914 17901701 3mra Actual 0.03
您可以使用以下代碼重新創建此 DataFrame:
df = pd.DataFrame({'id_1': ['ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685','ADH10685'],\
'id_2': ['CA1P0','CA1P0','CB0N0','CB0N0','CB0N0','CB0N0','CB0N0','CB0P0','CB0P0','CB0P0','CB0P0','CB0P0','CB0P0'],\
'effective_date': ['2018-07-31', '2018-08-31', '2018-07-31', '2018-08-31', '2018-10-31', '2018-11-30', '2018-12-31', '2018-07-31', '2018-08-31', '2018-09-30', '2018-10-31', '2018-11-30', '2018-12-31'],\
'value': [0.000048, 0.000048, 4.010784, 2.044298, 11.493831, 13.929844, 21.500490, 22.389493, 23.600726, 45.105458, 32.249056, 60.790889, 46.832914],\
'data_1': [17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701,17901701],\
'data_2': ['3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra','3mra'],\
'data_3': ['Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual','Actual'],\
'data_4': [198.00, 198.00, 0.01, 0.01,0.01,0.01,0.01,0.03,0.03,0.03,0.03,0.03,0.03]})
df.effective_date = pd.to_datetime(df.effective_date)
df = df.groupby(['id_1', 'id_2', 'effective_date']).first()
我感興趣的日期范圍是2018-07-31
到2018-12-31
。 對於id_1
和id_2
每個組合,我想重新采樣值。
對於('ADH10685', 'CA1P0')
,我想從 9 月到 12 月獲得0
值。 對於CB0N0
,我想將九月設置為0
,而對於CB0P0
,我什么都不想做。
value data_1 data_2 data_3 data_4
id_1 id_2 effective_date
ADH10685 CA1P0 2018-07-31 0.000048 17901701 3mra Actual 198.00
2018-08-31 0.000048 17901701 3mra Actual 198.00
2018-09-30 0.000000 17901701 3mra Actual 198.00
2018-10-31 0.000000 17901701 3mra Actual 198.00
2018-11-30 0.000000 17901701 3mra Actual 198.00
2018-12-31 0.000000 17901701 3mra Actual 198.00
CB0N0 2018-07-31 4.010784 17901701 3mra Actual 0.01
2018-08-31 2.044298 17901701 3mra Actual 0.01
2018-09-30 0.000008 17901701 3mra Actual 0.01
2018-10-31 11.493831 17901701 3mra Actual 0.01
2018-11-30 13.929844 17901701 3mra Actual 0.01
2018-12-31 21.500490 17901701 3mra Actual 0.01
CB0P0 2018-07-31 22.389493 17901701 3mra Actual 0.03
2018-08-31 23.600726 17901701 3mra Actual 0.03
2018-09-30 45.105458 17901701 3mra Actual 0.03
2018-10-31 32.249056 17901701 3mra Actual 0.03
2018-11-30 60.790889 17901701 3mra Actual 0.03
2018-12-31 46.832914 17901701 3mra Actual 0.03
我已經問了幾個與這個主題相關的問題[1] [2] ,所以我知道如何設置日期的上限和下限以及如何在保持非value
系列完整的同時重新采樣。
我開發了以下代碼,如果我對每個級別進行硬編碼,則該代碼有效。
min_date = '2018-07-31'
max_date = '2018-12-31'
# Slice to specific combination of id_1 and id_2
s = df.loc[('ADD00785', 'CA1P0')]
if not s.index.isin([min_date]).any():
s.loc[pd.to_datetime(min_date)] = np.nan
if not s.index.isin([max_date]).any():
s.loc[pd.to_datetime(max_date)] = np.nan
s.resample('M').first().fillna({'value': 0}).ffill().bfill()
我正在尋找有關如何最好地通過大型 DataFrame 並將邏輯應用於每對(id_1, id_2)
。 我還希望清理上面的示例代碼以提高效率。
首先,通過dt
重新索引每組id_1
、 id_2
。
dt = pd.date_range('2018-07-31', '2018-12-31', freq='M')
df = (df.reset_index()
.groupby(['id_1', 'id_2'])
.apply(lambda x: x.set_index('effective_date').reindex(dt))
.drop(columns=['id_1', 'id_2'])
.reset_index()
.rename(columns={'level_2':'effective_date'}))
然后在列值中填充缺失值。
df['value'] = df['value'].fillna(0)
填充剩余的缺失值。
df = df.groupby(['id_1', 'id_2']).apply(lambda x: x.ffill(axis=0).bfill(axis=0))
將id_1
、 id_1
、 id_2
設置回索引。
df.set_index(['id_1', 'id_2', 'effective_date'], inplace=True)
您可以使用reindex()
來獲取丟失的月份:
# create the MultiIndex based on the existing df.index.levels
midx = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
# run reindex() with the new indexes and then fix Nan `value` column
df1 = df.reindex(midx).fillna({'value':0})
df1
#Out[41]:
# value data_1 data_2 data_3 data_4
#id_1 id_2 effective_date
#ADH10685 CA1P0 2018-07-31 0.000048 17901701.0 3mra Actual 198.00
# 2018-08-31 0.000048 17901701.0 3mra Actual 198.00
# 2018-09-30 0.000000 NaN NaN NaN NaN
# 2018-10-31 0.000000 NaN NaN NaN NaN
# 2018-11-30 0.000000 NaN NaN NaN NaN
# 2018-12-31 0.000000 NaN NaN NaN NaN
# CB0N0 2018-07-31 4.010784 17901701.0 3mra Actual 0.01
# 2018-08-31 2.044298 17901701.0 3mra Actual 0.01
# 2018-09-30 0.000000 NaN NaN NaN NaN
# 2018-10-31 11.493831 17901701.0 3mra Actual 0.01
# 2018-11-30 13.929844 17901701.0 3mra Actual 0.01
# 2018-12-31 21.500490 17901701.0 3mra Actual 0.01
# CB0P0 2018-07-31 22.389493 17901701.0 3mra Actual 0.03
# 2018-08-31 23.600726 17901701.0 3mra Actual 0.03
# 2018-09-30 45.105458 17901701.0 3mra Actual 0.03
# 2018-10-31 32.249056 17901701.0 3mra Actual 0.03
# 2018-11-30 60.790889 17901701.0 3mra Actual 0.03
# 2018-12-31 46.832914 17901701.0 3mra Actual 0.03
# select columns except the 'value' column
cols = df1.columns.difference(['value'])
# update the selected columns with ffill/bfill per groupby on level=[0,1]
df1.loc[:,cols] = df1.loc[:,cols].groupby(level=[0,1]).transform('ffill')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.