What's an efficient way to groupby, filter and count the occurrence of a particular value within each group in Pandas?

Question

Hi how's it going? I have a giant dataframe and am trying to do a groupby, filter, then count within each group the occurrence of a particular event. The code I have works but doesn't scale well at all, it takes forever to run. Can someone help me with a fast way to perform the same computation? Below is what I have so far reproduced in a dummy example:

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

def sneeze_by_weather(df):
    num_sneeze = df[df['event']=='sneeze'].shape[0] 
    if num_sneeze==0:
        return 0
    else:
        return num_sneeze

df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()

Link to resulting dataframe

Is there any way to perform this computation much faster so that it scales when I have millions of rows?

Answer 1

This should be faster

idx_cols = ['date','person','weather']
idx = pd.MultiIndex.from_frame(df[idx_cols])

df_transformed = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reindex(idx, fill_value=0)
      .reset_index()
)

Output:

>>> df_transformed

         date person weather  0
0  2012-03-30   dave   rainy  1
1  2012-03-30   mike   sunny  0
2  2012-03-30   mike  cloudy  1
3  2012-03-30   dave  cloudy  1
4  2012-03-30   mike   rainy  0
5  2012-03-31   dave   sunny  0
6  2012-03-31   dave  cloudy  1
7  2012-03-31   dave   sunny  0
8  2012-03-31   mike  cloudy  1
9  2012-03-31   mike   rainy  1

Other option is to use merge

idx_cols = ['date','person','weather']

counts = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reset_index()
)

df_transformed = ( 
    df[idx_cols].merge(counts, on=idx_cols, how='left')
                .fillna(0)
                .astype({0: int})  # convert the type of the new column (labeled 0) to int. It was float due to NaNs 
)

I was overcomplicating... actually, this way is much more straightforward

idx_cols = ['date','person','weather']

df_transformed = (
    df.assign(is_sneeze=(df.event == 'sneeze'))
      .groupby(idx_cols, as_index=False)['is_sneeze']
      .sum()
)

Answer 2

%%timeit for your provided dataframe with my code below:

The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached. 1000 loops, best of 5: 495 µs per loop

df['event'].loc[(df['event'] == 'sneeze')] = 1
df['event'].loc[(df['event'] != 1)] = 0
df

index	date	person	weather	event
0	2012-03-30	dave	rainy	1
1	2012-03-30	mike	sunny	0
2	2012-03-30	mike	cloudy	1
3	2012-03-30	dave	cloudy	1
4	2012-03-30	mike	rainy	0
5	2012-03-31	dave	sunny	0
6	2012-03-31	dave	cloudy	1
7	2012-03-31	dave	sunny	0
8	2012-03-31	mike	cloudy	1
9	2012-03-31	mike	rainy	1

Answer 3

import pandas as pd

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

df_transformed = pd.DataFrame(df.groupby(['date','person','weather','event'])['event'].count()).rename(columns = {'event':'count'}).reset_index()
df_transformed['0'] = np.where((df_transformed['event'] == 'sneeze') & (df_transformed['count'] != 0),
                              df_transformed['count'],
                              0)

df_transformed = df_transformed.drop(labels = ['event','count'], axis = 1)

df_transformed

What's an efficient way to groupby, filter and count the occurrence of a particular value within each group in Pandas?

Question

3 answers

solution1
0 2022-05-29 20:41:01

solution2
0 2022-05-29 20:48:42

solution3
0 2022-05-29 20:48:52

What's an efficient way to groupby, filter and count the occurrence of a particular value within each group in Pandas?

Question

3 answers

solution1 0 2022-05-29 20:41:01

solution2 0 2022-05-29 20:48:42

solution3 0 2022-05-29 20:48:52

solution1
0 2022-05-29 20:41:01

solution2
0 2022-05-29 20:48:42

solution3
0 2022-05-29 20:48:52