I've got this sample DataFrame df
:
GridCode,User,DLang
3,224591119,es
3,224591119,ja
3,224591119,zh
4,224591119,es
6,146381773,en
9,17925282,ca
I would like to group the User
field, keeping only the most frequent DLang
code, then unstack and count the numbers of User
in each GridCode
. So far I did:
d = df.groupby(['GridCode','DLang']).size().unstack().fillna(0)
which correctly returns:
DLang ca en es ja zh
GridCode
3 0 0 1 1 1
4 0 0 1 0 0
6 0 1 0 0 0
9 1 0 0 0 0
However, as you can see in df
, some users have multiple DLang
entries (eg User 224591119), but I only want to count their most frequent DLang
code (eg for that user, it is es
). The resulting dataframe would be:
DLang ca en es
GridCode
3 0 0 1
4 0 0 1
6 0 1 0
9 1 0 0
First, count how many times a specific DLang
occurred, averaging across GridCode
.
g = df.groupby(['User','DLang']).count().reset_index()
g = g.rename(columns={'GridCode':'occurrences'})
Then, use the first()
function to find the most frequent/max occurrence for each user.
h = g.groupby('User').first().reset_index()
Merge just the most frequent/max occurrence df with the original input. This will drop rows where users used a DLang
other than the most frequent
j = pd.merge(df,h, on=['User','DLang'])
Finally, average across users to get your final counts.
final_df = j.groupby(['GridCode','DLang']).size().unstack().fillna(0)
DLang ca en es
GridCode
3 0 0 1
4 0 0 1
6 0 1 0
9 1 0 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.