I am adding some columns to a dataframe based on the grouping of other columns. I do some grouping, counting, and finally join the results back to the original dataframe.
The full data includes 1M rows, I first tried the approach with 20k rows, and it work ok. The data has an entry for each item a customer added to the order.
Here is a sample data:
import numpy as np
import pandas as pd
data = np.matrix([[101,201,301],[101,201,302],[101,201,303],[101,202,301],[101,202,302],[101,203,301]])
df = pd.DataFrame(data, columns=['customer_id', 'order_id','item_id'])
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
For the sample data above the desired output is:
| customer_id | order_id | item_id | total_nitems_user_lifetime | nitems_in_order
| 101 | 201 | 301 | 6 | 3
| 101 | 201 | 302 | 6 | 3
| 101 | 201 | 303 | 6 | 3
| 101 | 202 | 301 | 6 | 2
| 101 | 202 | 302 | 6 | 2
| 101 | 203 | 301 | 6 | 1
The piece of the code that works relatively fast even with 1M rows is:
df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
But a similar join, takes quite some time ~couple hours:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
I am hoping that there is an smarter way to get the same aggregate value. I understand why is taking long in the second case as the number of groups increases quite a bit. Thank you
OK, I can see what you are trying to achieve and on this sample size it's over 2x faster and I think is likely to scale much better also, basically instead of joining/merging the result of your groupby back to your original df, just call transform
:
In [24]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
df
100 loops, best of 3: 2.66 ms per loop
100 loops, best of 3: 2.85 ms per loop
Out[24]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
In [26]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
df
100 loops, best of 3: 6.4 ms per loop
100 loops, best of 3: 6.46 ms per loop
Out[26]:
customer_id order_id item_id total_nitems_user_lifetime nitems_in_order
0 101 201 301 6 3
1 101 201 302 6 3
2 101 201 303 6 3
3 101 202 301 6 2
4 101 202 302 6 2
5 101 203 301 6 1
Interestingly when I try this on a 600,000 row df:
In [34]:
%timeit df['total_nitems_user_lifetime'] = df.groupby('customer_id')['order_id'].transform('count')
%timeit df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
10 loops, best of 3: 160 ms per loop
1 loops, best of 3: 231 ms per loop
In [36]:
%timeit df['total_nitems_user_lifetime'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x']
%timeit df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
10 loops, best of 3: 208 ms per loop
10 loops, best of 3: 215 ms per loop
My first method is about 25% faster but is actually slower than your method, I think it's worth trying on your real data to see if it yields any speed improvements.
If we combine the column creations so that it's on a single line:
In [40]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.groupby('customer_id')['order_id'].transform('count'), df.groupby('order_id')['customer_id'].transform('count')
1 loops, best of 3: 425 ms per loop
In [42]:
%timeit df['total_nitems_user_lifetime'], df['nitems_in_order'] = df.join(df.groupby('customer_id').count()\
['order_id'],on='customer_id',rsuffix="_x")['order_id_x'] , df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
1 loops, best of 3: 447 ms per loop
We can see that my combined code is marginally faster than yours so there's not much saved by doing this, normally you can apply multiple aggregation functions so that you can return multiple columns, but the problem here is that you are grouping by different columns so we have to perform 2 expensive groupby operations.
Original approach, with 1M rows:
df['nitems_in_order'] = df.join(df.groupby('order_id').count()\
['customer_id'],on='order_id',rsuffix="_x")['customer_id_x']
time: 0:00:02.422288
Transform suggestion by @EdChum:
df['nitems_in_order'] = df.groupby('order_id')['customer_id'].transform('count')
time: 0:00:04.713601
Use groupby, then select one column, then count, convert back to dataframe, and finally join. Result: much faster:
df = df.join(df.groupby(['order_id'])['order_id'].count().to_frame('nitems_in_order'),on='order_id')
time: 0:00:0.406383
Thanks.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.