[英]pandas - unstack with most frequent values in MultiIndex DataFrame
I've got this sample DataFrame df
: 我有这个样本DataFrame
df
:
GridCode,User,DLang
3,224591119,es
3,224591119,ja
3,224591119,zh
4,224591119,es
6,146381773,en
9,17925282,ca
I would like to group the User
field, keeping only the most frequent DLang
code, then unstack and count the numbers of User
in each GridCode
. 我想对
User
字段进行分组,仅保留最频繁的DLang
代码,然后对每个GridCode
的User
数量进行GridCode
和计数。 So far I did: 到目前为止,我做到了:
d = df.groupby(['GridCode','DLang']).size().unstack().fillna(0)
which correctly returns: 正确返回:
DLang ca en es ja zh
GridCode
3 0 0 1 1 1
4 0 0 1 0 0
6 0 1 0 0 0
9 1 0 0 0 0
However, as you can see in df
, some users have multiple DLang
entries (eg User 224591119), but I only want to count their most frequent DLang
code (eg for that user, it is es
). 但是,正如您在
df
看到的那样,某些用户具有多个DLang
条目(例如,用户224591119),但是我只想计算他们最频繁的DLang
代码(例如,对于该用户,它是es
)。 The resulting dataframe would be: 结果数据框将是:
DLang ca en es
GridCode
3 0 0 1
4 0 0 1
6 0 1 0
9 1 0 0
First, count how many times a specific DLang
occurred, averaging across GridCode
. 首先,计算特定
DLang
发生次数,取平均值为GridCode
。
g = df.groupby(['User','DLang']).count().reset_index()
g = g.rename(columns={'GridCode':'occurrences'})
Then, use the first()
function to find the most frequent/max occurrence for each user. 然后,使用
first()
函数查找每个用户的最频繁/最大出现次数。
h = g.groupby('User').first().reset_index()
Merge just the most frequent/max occurrence df with the original input. 仅将最频繁/最大出现次数df与原始输入合并。 This will drop rows where users used a
DLang
other than the most frequent 这将删除用户使用
DLang
而不是最频繁的行
j = pd.merge(df,h, on=['User','DLang'])
Finally, average across users to get your final counts. 最后,对所有用户进行平均,以得出最终结果。
final_df = j.groupby(['GridCode','DLang']).size().unstack().fillna(0)
DLang ca en es
GridCode
3 0 0 1
4 0 0 1
6 0 1 0
9 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.