Manipulate A Group Column in Pandas

Question

I have a data set with columns Dist, Class, and Count.

I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).

The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?

import pandas as pd
import numpy as np

a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])

def manipcolumn(x):
    csum = x['Count'].sum()
    x['Count'] = x['Count'].apply(lambda x: x/csum)
    return x

s.groupby('Dist').apply(manipcolumn)

Answer 1

One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:

s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)

This avoids the need for a bespoke Python function and the use of apply . Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

Manipulate A Group Column in Pandas

Question

1 answers

solution1
2 2015-03-04 21:44:01

Manipulate A Group Column in Pandas

Question

1 answers

solution1 2 2015-03-04 21:44:01

solution1
2 2015-03-04 21:44:01