应用自定义滚动 function 到 pandas dataframe 与日期时间索引

Question

I have a pandas dataframe on which I wish to apply my own custom rolling function as follows:我有一个 pandas dataframe 我希望在其上应用我自己的自定义滚动 function 如下：

def testms(x, field):
    mu = np.sum(x[field])
    si = np.sum(x[field])/len(x[field])
    x['mu'] = mu
    x['si'] = si
    return x

df2 = pd.concat([pd.DataFrame({'A':[1,1,1,1,1,2,2,2,2,2]}),
      pd.DataFrame({'B':random_dates(pd.to_datetime('2015-01-01'), 
        pd.to_datetime('2018-01-01'), 10)}),
      pd.DataFrame({'C':np.random.rand(10)})],axis=1)

df2
   A                             B         C
0  1 2016-08-25 01:09:42.953011200  0.791725
1  1 2017-02-23 13:30:20.296310399  0.528895
2  1 2016-10-23 05:33:14.994806400  0.568045
3  1 2016-08-20 17:41:03.991027200  0.925597
4  1 2016-04-09 17:59:00.805200000  0.071036
5  2 2016-12-09 13:06:00.751737600  0.087129
6  2 2016-04-25 00:47:45.953232000  0.020218
7  2 2017-09-05 06:35:58.432531200  0.832620
8  2 2017-11-23 03:18:47.370528000  0.778157
9  2 2016-02-25 15:14:53.907532800  0.870012

tester = lambda x: testms(x, 'C')
df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1).apply(tester).reset_index()

However when I apply the above code, I get the following error:但是，当我应用上述代码时，出现以下错误：

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

Answer 1

If use Rolling.apply it working differently like GroupBy.apply - it processing each columns separately and not possible return multiple columns, only scalars:如果使用Rolling.apply它的工作方式与GroupBy.apply不同 - 它分别处理每一列并且不可能返回多列，只有标量：

So in your solution are necessary 2 functions, where is not possible specify column, but column for processing is specify after groupby :因此，在您的解决方案中需要 2 个函数，其中不可能指定列，但在groupby之后指定用于处理的列：

def testms1(x):
    mu = np.sum(x)
    return mu

def testms2(x):
    #same like mean
    #si = np.sum(x)/len(x)
    si = np.mean(x)
    return si


tester1 = lambda x: testms1(x)
tester2 = lambda x: testms2(x)
r = df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1)
s1 = r.apply(tester1, raw=False).rename('mu')
s2 = r.apply(tester2, raw=False).rename('si')
df = pd.concat([s1, s2], axis=1).reset_index()
print (df)
   A                             B        mu        si
0  1 2016-08-25 01:09:42.953011200  0.791725  0.791725
1  1 2017-02-23 13:30:20.296310399  0.528895  0.528895
2  1 2016-10-23 05:33:14.994806400  1.096940  0.548470
3  1 2016-08-20 17:41:03.991027200  2.022537  0.674179
4  1 2016-04-09 17:59:00.805200000  2.093573  0.523393
5  2 2016-12-09 13:06:00.751737600  0.087129  0.087129
6  2 2016-04-25 00:47:45.953232000  0.107347  0.053673
7  2 2017-09-05 06:35:58.432531200  0.832620  0.832620
8  2 2017-11-23 03:18:47.370528000  1.610777  0.805389
9  2 2016-02-25 15:14:53.907532800  2.480789  0.826930

Alternative solution with Resampler.aggregate :使用Resampler.aggregate的替代解决方案：

r = df2.set_index('B').groupby('A')['C'].rolling('90D', min_periods=1)
df1 = r.agg(['sum','mean']).rename(columns={'sum':'mu', 'mean':'si'}).reset_index()
print (df1)
   A                             B        mu        si
0  1 2016-08-25 01:09:42.953011200  0.791725  0.791725
1  1 2017-02-23 13:30:20.296310399  0.528895  0.528895
2  1 2016-10-23 05:33:14.994806400  1.096940  0.548470
3  1 2016-08-20 17:41:03.991027200  2.022537  0.674179
4  1 2016-04-09 17:59:00.805200000  2.093573  0.523393
5  2 2016-12-09 13:06:00.751737600  0.087129  0.087129
6  2 2016-04-25 00:47:45.953232000  0.107347  0.053673
7  2 2017-09-05 06:35:58.432531200  0.832620  0.832620
8  2 2017-11-23 03:18:47.370528000  1.610777  0.805389
9  2 2016-02-25 15:14:53.907532800  2.480789  0.826930

应用自定义滚动 function 到 pandas dataframe 与日期时间索引

问题描述

1 个解决方案

解决方案1
1 2019-11-10 06:59:40

应用自定义滚动 function 到 pandas dataframe 与日期时间索引

问题描述

1 个解决方案

解决方案1 1 2019-11-10 06:59:40

解决方案1
1 2019-11-10 06:59:40