[英]How to speed up a python for loop
我有以下函数,其中 df 是一个 159538 行 x 3 列的熊猫数据框:
dfs = []
for i in df['email_address']:
data = df[df['email_address'] == i]
data['difference'] = data['ts_placed'].diff().astype('timedelta64[D]')
repeat = []
for a in data['difference']:
if a > 10:
repeat.append(0)
elif a <= 10:
repeat.append(1)
else:
repeat.append(0)
data['repeat'] = repeat
dfs.append(data)
该功能运行速度极慢。 我想通过使用多处理来加速这个过程。 这个 SO 问题显示了如何在 R 中做到这一点。 python 的等效代码是什么?
这是运行后的数据示例:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df
Out[6]:
email_address ts_placed difference
0 aaaaaaaaaaaaa@sky.com 2015-08-06 00:00:34 NaT
1 dfdfdfdfdfd@babcock.co.uk 2015-08-06 00:05:38 NaT
2 littlemifddreen85@hotmail.co.uk 2015-08-06 00:09:20 NaT
3 smifdfddfms@aol.com 2015-08-06 00:10:01 NaT
4 terry.wfdfdfdfdfy-holdings.co.uk 2015-08-06 00:14:00 NaT
5 r.dfdfdfdfd16@hotmail.com 2015-08-06 00:14:00 NaT
6 kdfdfdf979@outlook.com 2015-08-06 00:14:00 NaT
7 dd@ggggggggggg.eclipse.co.uk 2015-08-06 00:14:20 NaT
8 gggz45@hotmail.co.uk 2015-08-06 00:14:43 NaT
9 gggggggggi@hotmail.co.uk 2015-08-06 00:17:03 NaT
10 mggggggggyke1@hotmail.com 2015-08-06 00:17:58 NaT
...
22 ffdddfddd@yahoo.com 2015-08-06 00:46:12 0 days 00:04:15
IIUC 然后您可以执行以下操作:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x < 10).cumcount())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.