简体   繁体   English

数据帧子组中的滚动总和(熊猫)

[英]Rolling sum in subgroups of a dataframe (pandas)

I have sessions dataframe that contains E-mail and Sessions (int) columns. 我有包含“ E-mail和“ Sessions (int)”列的sessions数据框。

I need to calculate rolling sum of sessions per email (ie not globally). 我需要计算每封电子邮件的会话滚动总和(即,不是全局的)。

Now, the following works, but it's painfully slow: 现在,以下工作有效,但速度很慢:

emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
    email_sessions = sessions[sessions['E-mail'] == em]
    email_sessions.is_copy = False
    email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
    ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)

Is there a way of achieving the same in pandas , but using pandas operators on a dataframe instead of creating separate dataframes for each email and then concatenating them? 有没有一种方法可以在pandas中实现相同的功能,但是可以在数据框上使用pandas运算符,而不是为每个电子邮件创建单独的数据框然后将它们串联起来?

(either that or some other way of making this faster) (以其他方式或其他更快的方式)

Say you start with 说你开始

In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})

In [59]: df
Out[59]: 
  E-Mail  Session
0    foo        0
1    foo        1
2    foo        2
3    bar        3
4    bar        4
5    bar        5
6    foo        6
7    foo        7
8    foo        8

In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]: 
          Session
E-Mail           
bar    3      NaN
       4      NaN
       5     12.0
foo    0      NaN
       1      NaN
       2      3.0
       6      9.0
       7     15.0
       8     21.0

Incidentally, note that I just rearranged your rolling_sum , but it has been deprecated - you should now use rolling : 顺便说一句,请注意,我刚刚重新排列了您的rolling_sum ,但已弃用了它-您现在应该使用rolling

df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())

Setup 设定

np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
                   'Session': np.random.randint(1, 10, 20)})

Solution

The current and proper way to do this is with rolling.sum that can b used on the result of a pd.Series group by object. 当前正确的方法是使用rolling.sum ,它可以按对象用于pd.Series的结果。

#      Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
#                            \--------------/
#                             Method you want

E-Mail    
A       0      NaN
        2      NaN
        4     11.0
        5      7.0
        7     10.0
        12    16.0
        15    16.0
        17    16.0
        18    17.0
        19    18.0
B       1      NaN
        3      NaN
        6     18.0
        8     14.0
        9     16.0
        10    12.0
        11    13.0
        13    16.0
        14    20.0
        16    22.0
Name: Session, dtype: float64

Details 细节

df

   E-Mail  Session
0       A        9
1       B        7
2       A        1
3       B        3
4       A        1
5       A        5
6       B        8
7       A        4
8       B        3
9       B        5
10      B        4
11      B        4
12      A        7
13      B        8
14      B        8
15      A        5
16      B        6
17      A        4
18      A        8
19      A        6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM