简体   繁体   中英

Pandas dataframe join groupby speed up

I am adding some columns to a dataframe based on the grouping of other columns. I do some grouping, counting, and finally join the results back to the original dataframe.

The full data includes 1M rows, I first tried the approach with 20k rows, and it work ok. The data has an entry for each item a customer added to the order.

Here is a sample data:

import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

For the sample data above the desired output is:

| customer_id   | order_id | item_id     | total_nitems_user_lifetime | nitems_in_order
|   101 | 201      |   301   |      6             |    3           
|   101 | 201      |   302   |      6             |    3           
|   101 | 201      |   303   |      6             |    3           
|   101 | 202      |   301   |      6             |    2           
|   101 | 202      |   302   |      6             |    2           
|   101 | 203      |   301   |      6             |    1           

The piece of the code that works relatively fast even with 1M rows is:

df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
          ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']

But a similar join, takes quite some time ~couple hours:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']

I am hoping that there is an smarter way to get the same aggregate value. I understand why is taking long in the second case as the number of groups increases quite a bit. Thank you

OK, I can see what you are trying to achieve and on this sample size it's over 2x faster and I think is likely to scale much better also, basically instead of joining/merging the result of your groupby back to your original df, just call transform :

In [24]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1
In [26]:


%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
   customer_id  order_id  item_id  total_nitems_user_lifetime  nitems_in_order
0          101       201      301                           6                3
1          101       201      302                           6                3
2          101       201      303                           6                3
3          101       202      301                           6                2
4          101       202      302                           6                2
5          101       203      301                           6                1

Interestingly when I try this on a 600,000 row df:

In [34]:

%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:

%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop

My first method is about 25% faster but is actually slower than your method, I think it's worth trying on your real data to see if it yields any speed improvements.

If we combine the column creations so that it's on a single line:

In [40]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'),  df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:

%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
      ['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
   ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop

We can see that my combined code is marginally faster than yours so there's not much saved by doing this, normally you can apply multiple aggregation functions so that you can return multiple columns, but the problem here is that you are grouping by different columns so we have to perform 2 expensive groupby operations.

Original approach, with 1M rows:

df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
                       ['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time:  0:00:02.422288

Transform suggestion by @EdChum:

df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601

Use groupby, then select one column, then count, convert back to dataframe, and finally join. Result: much faster:

df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383

Thanks.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM