简体   繁体   中英

Manipulate A Group Column in Pandas

I have a data set with columns Dist, Class, and Count.

I want to group that data set by dist and divide the count column of each group by the sum of the counts for that group (normalize it to one).

The following MWE demonstrates my approach thus far. But I wonder: is there a more compact/pandaific way of writing this?

import pandas as pd
import numpy as np

a = np.random.randint(0,4,(10,3))
s = pd.DataFrame(a,columns=['Dist','Class','Count'])

def manipcolumn(x):
    csum = x['Count'].sum()
    x['Count'] = x['Count'].apply(lambda x: x/csum)
    return x

s.groupby('Dist').apply(manipcolumn)

One alternative way to get the normalised 'Count' column could be to use groupby and transform to get the sums for each group and then divide the returned Series by the 'Count' column. You can reassign this Series back to your DataFrame:

s['Count'] = s['Count'] / s.groupby('Dist')['Count'].transform(np.sum)

This avoids the need for a bespoke Python function and the use of apply . Testing it for the small example DataFrame in your question showed that it was around 8 times faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM