[英]Rolling sum in subgroups of a dataframe (pandas)
I have sessions
dataframe that contains E-mail
and Sessions
(int) columns. 我有包含“
E-mail
和“ Sessions
(int)”列的sessions
数据框。
I need to calculate rolling sum of sessions per email (ie not globally). 我需要计算每封电子邮件的会话滚动总和(即,不是全局的)。
Now, the following works, but it's painfully slow: 现在,以下工作有效,但速度很慢:
emails = set(list(sessions['E-mail']))
ses_sums = []
for em in emails:
email_sessions = sessions[sessions['E-mail'] == em]
email_sessions.is_copy = False
email_sessions['Session_Rolling_Sum'] = pd.rolling_sum(email_sessions['Sessions'], window=self.window).fillna(0)
ses_sums.append(email_sessions)
df = pd.concat(ses_sums, ignore_index=True)
Is there a way of achieving the same in pandas
, but using pandas
operators on a dataframe instead of creating separate dataframes for each email and then concatenating them? 有没有一种方法可以在
pandas
中实现相同的功能,但是可以在数据框上使用pandas
运算符,而不是为每个电子邮件创建单独的数据框然后将它们串联起来?
(either that or some other way of making this faster) (以其他方式或其他更快的方式)
Say you start with 说你开始
In [58]: df = pd.DataFrame({'E-Mail': ['foo'] * 3 + ['bar'] * 3 + ['foo'] * 3, 'Session': range(9)})
In [59]: df
Out[59]:
E-Mail Session
0 foo 0
1 foo 1
2 foo 2
3 bar 3
4 bar 4
5 bar 5
6 foo 6
7 foo 7
8 foo 8
In [60]: df[['Session']].groupby(df['E-Mail']).apply(pd.rolling_sum, 3)
Out[60]:
Session
E-Mail
bar 3 NaN
4 NaN
5 12.0
foo 0 NaN
1 NaN
2 3.0
6 9.0
7 15.0
8 21.0
Incidentally, note that I just rearranged your rolling_sum
, but it has been deprecated - you should now use rolling
: 顺便说一句,请注意,我刚刚重新排列了您的
rolling_sum
,但已弃用了它-您现在应该使用rolling
:
df[['Session']].groupby(df['E-Mail']).apply(lambda g: g.rolling(3).sum())
Setup 设定
np.random.seed([3,1415])
df = pd.DataFrame({'E-Mail': np.random.choice(list('AB'), 20),
'Session': np.random.randint(1, 10, 20)})
Solution 解
The current and proper way to do this is with rolling.sum
that can b used on the result of a pd.Series
group by object. 当前正确的方法是使用
rolling.sum
,它可以按对象用于pd.Series
的结果。
# Series Group By
# /------------------------\
df.groupby('E-Mail').Session.rolling(3).sum()
# \--------------/
# Method you want
E-Mail
A 0 NaN
2 NaN
4 11.0
5 7.0
7 10.0
12 16.0
15 16.0
17 16.0
18 17.0
19 18.0
B 1 NaN
3 NaN
6 18.0
8 14.0
9 16.0
10 12.0
11 13.0
13 16.0
14 20.0
16 22.0
Name: Session, dtype: float64
Details 细节
df
E-Mail Session
0 A 9
1 B 7
2 A 1
3 B 3
4 A 1
5 A 5
6 B 8
7 A 4
8 B 3
9 B 5
10 B 4
11 B 4
12 A 7
13 B 8
14 B 8
15 A 5
16 B 6
17 A 4
18 A 8
19 A 6
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.