简体   繁体   中英

pandas - unstack with most frequent values in MultiIndex DataFrame

I've got this sample DataFrame df :

GridCode,User,DLang
3,224591119,es
3,224591119,ja
3,224591119,zh
4,224591119,es
6,146381773,en
9,17925282,ca

I would like to group the User field, keeping only the most frequent DLang code, then unstack and count the numbers of User in each GridCode . So far I did:

d = df.groupby(['GridCode','DLang']).size().unstack().fillna(0)

which correctly returns:

DLang     ca  en  es  ja  zh
GridCode                    
3          0   0   1   1   1
4          0   0   1   0   0
6          0   1   0   0   0
9          1   0   0   0   0

However, as you can see in df , some users have multiple DLang entries (eg User 224591119), but I only want to count their most frequent DLang code (eg for that user, it is es ). The resulting dataframe would be:

DLang     ca  en  es
GridCode                    
3          0   0   1
4          0   0   1
6          0   1   0
9          1   0   0

First, count how many times a specific DLang occurred, averaging across GridCode .

g = df.groupby(['User','DLang']).count().reset_index()
g = g.rename(columns={'GridCode':'occurrences'})

Then, use the first() function to find the most frequent/max occurrence for each user.

h = g.groupby('User').first().reset_index()

Merge just the most frequent/max occurrence df with the original input. This will drop rows where users used a DLang other than the most frequent

j = pd.merge(df,h, on=['User','DLang'])

Finally, average across users to get your final counts.

final_df = j.groupby(['GridCode','DLang']).size().unstack().fillna(0)

DLang     ca  en  es
GridCode            
3          0   0   1
4          0   0   1
6          0   1   0
9          1   0   0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM