[英]Apply custom cumulative function to pandas dataframe
I have a dataframe sorted by date
: 我有一个按
date
排序的数据框:
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort('date')
print df
date idx pct_val val
3 2016-04-30 2 NaN 10
0 2016-04-30 1 NaN 10
4 2016-05-31 2 -10 0
1 2016-05-31 1 -10 0
5 2016-06-31 2 -10 0
2 2016-06-31 1 NaN 5
And I want to group by idx
then apply a cumulative function with some simple logic. 我想按
idx
分组,然后用一些简单的逻辑应用累积函数。 If pct_val
is null, add val
to to running total, otherwise multiply running total by 1 + pct_val/100
. 如果
pct_val
为null,则将val
添加到运行总计,否则将运行总计乘以1 + pct_val/100
。 'cumsum'
shows the result of df.groupby('idx').val.cumsum()
and 'cumulative_func'
is the result I want. 'cumsum'
显示df.groupby('idx').val.cumsum()
的结果df.groupby('idx').val.cumsum()
和'cumulative_func'
是我想要的结果。
date idx pct_val val cumsum cumulative_func
3 2016-04-30 2 NaN 10 10 10
0 2016-04-30 1 NaN 10 10 10
4 2016-05-31 2 -10 0 10 9
1 2016-05-31 1 -10 0 10 9
5 2016-06-31 2 -10 0 10 8
2 2016-06-31 1 NaN 5 15 14
Any idea if there is a way to do apply a custom cumulative function to a dataframe or a better way to achieve this? 知道是否有办法将自定义累积函数应用于数据框或更好的方法来实现这一点?
I don't believe there is an easy way to accomplish your objective using vectorization. 我不相信有一种简单的方法可以使用矢量化来实现您的目标。 I would first try to get something working, and then optimize for speed if required.
我会先尝试一些工作,然后根据需要优化速度。
def cumulative_func(df):
results = []
for group in df.groupby('idx').groups.itervalues():
total = 0
result = []
for p, v in df.ix[group, ['pct_val', 'val']].values:
if np.isnan(p):
total += v
else:
total *= (1 + .01 * p)
result.append(total)
results.append(pd.Series(result, index=group))
return pd.concat(results).reindex(df.index)
df['cumulative_func'] = cumulative_func(df)
>>> df
date idx pct_val val cumulative_func
3 2016-04-30 2 NaN 10 10.0
0 2016-04-30 1 NaN 10 10.0
4 2016-05-31 2 -10 0 9.0
1 2016-05-31 1 -10 0 9.0
5 2016-06-31 2 -10 0 8.1
2 2016-06-31 1 NaN 5 14.0
First I cleaned up your setup 首先,我清理了你的设置
df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
'date': ['2016-04-30', '2016-05-31', '2016-06-31',
'2016-04-30', '2016-05-31', '2016-06-31'],
'val': [10, 0, 5, 10, 0, 0],
'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort_values(['date', 'idx'])
print df
Looks like: 好像:
date idx pct_val val
0 2016-04-30 1 NaN 10
3 2016-04-30 2 NaN 10
1 2016-05-31 1 -10.0 0
4 2016-05-31 2 -10.0 0
2 2016-06-31 1 NaN 5
5 2016-06-31 2 -10.0 0
def cumcustom(df):
df = df.copy()
running_total = 0
for idx, row in df.iterrows():
if pd.isnull(row.ix['pct_val']):
running_total += row.ix['val']
else:
running_total *= row.ix['pct_val'] / 100. + 1
df.loc[idx, 'cumcustom'] = running_total
return df
Then apply 然后申请
df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx'])
Looks like: 好像:
date idx pct_val val cumcustom
0 2016-04-30 1 NaN 10 10.0
3 2016-04-30 2 NaN 10 10.0
1 2016-05-31 1 -10.0 0 9.0
4 2016-05-31 2 -10.0 0 9.0
2 2016-06-31 1 NaN 5 14.0
5 2016-06-31 2 -10.0 0 8.1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.