简体   繁体   中英

Pandas calculate common occurrence percentage of a group by

I would think this is a simple task, but have had a difficult time getting it working. I have a pandas dataframe

IDX     Name    Fruit
 0      Mary    Grapes
 1      Tom     Apple
 2      Mary    Grapes
 3      Mary    Apple
 4      Tom     Apple
 5      Mary    Grapes
 6      Tom     Grapes
 7      Tom     Apple

My expected result

IDX     Name    Occurrence     Perc
 0      Mary    Grapes         75%
 1      Tom     Apple          75%

So my end result would be a data frame grouped by name, with the most common occurrence for fruit and the percentage of that occurrence

So to get the most common occurrence i use

df["Occurrence"] = (df.groupby("Name")["Fruit"].transform(lambda x: x.value_counts().index[0])) 

How would i go about populating the perc column?

Let us do crosstab then sort_values with tail

out = pd.crosstab(df.Name,df.Fruit,normalize='index').stack().sort_values().groupby(level=0).tail(1)
Out[65]: 
Name  Fruit 
Mary  Grapes    0.75
Tom   Apple     0.75
dtype: float64

Use Series.value_counts with normalize=True in lambda function and return first row by DataFrame.head , last convert Multiindex Series to DataFrame :

df1 = (df.groupby("Name")["Fruit"]
         .apply(lambda x: x.value_counts(normalize=True).head(1))
         .mul(100)
         .rename_axis(['Name','Occurrence'])
         .reset_index(name='Perc'))
print (df1)
   Name Occurrence  Perc
0  Mary     Grapes  75.0
1   Tom      Apple  75.0

Here is another way.

df['pcts'] = df.groupby(['name','fruit'])['fruit'].transform('size').div(df['fruit'].groupby(df['name']).transform('size'))

df.loc[df.groupby('name')['pcts'].idxmax()]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM