将自定义累积函数应用于pandas数据帧

Question

I have a dataframe sorted by date : 我有一个按date排序的数据框：

df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
                   'date': ['2016-04-30', '2016-05-31', '2016-06-31',
                            '2016-04-30', '2016-05-31', '2016-06-31'],
                   'val': [10, 0, 5, 10, 0, 0],
                   'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort('date')
print df

         date  idx  pct_val  val
3  2016-04-30    2      NaN   10
0  2016-04-30    1      NaN   10
4  2016-05-31    2      -10    0
1  2016-05-31    1      -10    0
5  2016-06-31    2      -10    0
2  2016-06-31    1      NaN    5

And I want to group by idx then apply a cumulative function with some simple logic. 我想按idx分组，然后用一些简单的逻辑应用累积函数。 If pct_val is null, add val to to running total, otherwise multiply running total by 1 + pct_val/100 . 如果pct_val为null，则将val添加到运行总计，否则将运行总计乘以1 + pct_val/100 。 'cumsum' shows the result of df.groupby('idx').val.cumsum() and 'cumulative_func' is the result I want. 'cumsum'显示df.groupby('idx').val.cumsum()的结果df.groupby('idx').val.cumsum()和'cumulative_func'是我想要的结果。

         date  idx  pct_val  val  cumsum  cumulative_func
3  2016-04-30    2      NaN   10      10               10
0  2016-04-30    1      NaN   10      10               10
4  2016-05-31    2      -10    0      10                9
1  2016-05-31    1      -10    0      10                9
5  2016-06-31    2      -10    0      10                8
2  2016-06-31    1      NaN    5      15               14

Any idea if there is a way to do apply a custom cumulative function to a dataframe or a better way to achieve this? 知道是否有办法将自定义累积函数应用于数据框或更好的方法来实现这一点？

Answer 1

I don't believe there is an easy way to accomplish your objective using vectorization. 我不相信有一种简单的方法可以使用矢量化来实现您的目标。 I would first try to get something working, and then optimize for speed if required. 我会先尝试一些工作，然后根据需要优化速度。

def cumulative_func(df):
    results = []
    for group in df.groupby('idx').groups.itervalues():
        total = 0
        result = []
        for p, v in df.ix[group, ['pct_val', 'val']].values:
            if np.isnan(p):
                total += v
            else:
                total *= (1 + .01 * p)
            result.append(total)
        results.append(pd.Series(result, index=group))
    return pd.concat(results).reindex(df.index)

df['cumulative_func'] = cumulative_func(df)

>>> df
         date  idx  pct_val  val  cumulative_func
3  2016-04-30    2      NaN   10             10.0
0  2016-04-30    1      NaN   10             10.0
4  2016-05-31    2      -10    0              9.0
1  2016-05-31    1      -10    0              9.0
5  2016-06-31    2      -10    0              8.1
2  2016-06-31    1      NaN    5             14.0

Answer 2

First I cleaned up your setup 首先，我清理了你的设置

Setup 设定

df = pd.DataFrame({'idx': [1, 1, 1, 2, 2, 2],
                   'date': ['2016-04-30', '2016-05-31', '2016-06-31',
                            '2016-04-30', '2016-05-31', '2016-06-31'],
                   'val': [10, 0, 5, 10, 0, 0],
                   'pct_val': [None, -10, None, None, -10, -10]})
df = df.sort_values(['date', 'idx'])
print df

Looks like: 好像：

         date  idx  pct_val  val
0  2016-04-30    1      NaN   10
3  2016-04-30    2      NaN   10
1  2016-05-31    1    -10.0    0
4  2016-05-31    2    -10.0    0
2  2016-06-31    1      NaN    5
5  2016-06-31    2    -10.0    0

Solution 解

def cumcustom(df):
    df = df.copy()
    running_total = 0
    for idx, row in df.iterrows():
        if pd.isnull(row.ix['pct_val']):
            running_total += row.ix['val']
        else:
            running_total *= row.ix['pct_val'] / 100. + 1
        df.loc[idx, 'cumcustom'] = running_total
    return df

Then apply 然后申请

df.groupby('idx').apply(cumcustom).reset_index(drop=True).sort_values(['date', 'idx'])

Looks like: 好像：

         date  idx  pct_val  val  cumcustom
0  2016-04-30    1      NaN   10       10.0
3  2016-04-30    2      NaN   10       10.0
1  2016-05-31    1    -10.0    0        9.0
4  2016-05-31    2    -10.0    0        9.0
2  2016-06-31    1      NaN    5       14.0
5  2016-06-31    2    -10.0    0        8.1

将自定义累积函数应用于pandas数据帧

问题描述

2 个解决方案

解决方案1
4 已采纳 2016-05-17 19:47:32

解决方案2
1 2016-05-17 20:07:20

Setup 设定

Solution 解

将自定义累积函数应用于pandas数据帧

问题描述

2 个解决方案

解决方案1 4 已采纳 2016-05-17 19:47:32

解决方案2 1 2016-05-17 20:07:20

Setup 设定

Solution 解

解决方案1
4 已采纳 2016-05-17 19:47:32

解决方案2
1 2016-05-17 20:07:20