简体   繁体   中英

Pandas groupby and transform takes long

Given a DataFrame similar to this (but with over a million rows and about 140000 different group s)

df_test = pd.DataFrame({'group': {1:'A', 2:'A', 3:'A', 4:'A', 5:'B', 6:'B'},
                        'time' : {1:1,   2:3,   3:5,   4:23,  5: 7,  6: 12}})

for each group I want to find the difference between the time (which is actually a dtype('<M8[ns]') in my real df) and the minimum time for that group .

I have managed it using groupby and transform as follows:

df_test['time_since'] = df_test.groupby('group')['time'].transform(lambda d: d - d.min())

which correctly produces:

    group   time    time_since
1   A       1       0
2   A       3       2
3   A       5       4
4   A       23      22
5   B       7       0
6   B       12      5

but it takes almost a minute to compute. Is there a faster / smarter way to do this?

My suggestion: doing lambda (calculation) outside the transform , so we do not need lambda here. With the lambda , we calling the calculation couple times (Depends on how many groups)

df_test=pd.concat([df_test]*1000)
%timeit df_test['time']-df_test.groupby('group')['time'].transform(min)
1000 loops, best of 3: 1.11 ms per loop
%timeit df_test.groupby('group')['time'].transform(lambda d: d - d.min())
The slowest run took 7.20 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.3 ms per loop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM