熊猫-在MultiIndex DataFrame中使用最频繁的值进行堆叠

Question

I've got this sample DataFrame df : 我有这个样本DataFrame df ：

GridCode,User,DLang
3,224591119,es
3,224591119,ja
3,224591119,zh
4,224591119,es
6,146381773,en
9,17925282,ca

I would like to group the User field, keeping only the most frequent DLang code, then unstack and count the numbers of User in each GridCode . 我想对User字段进行分组，仅保留最频繁的DLang代码，然后对每个GridCode的User数量进行GridCode和计数。 So far I did: 到目前为止，我做到了：

d = df.groupby(['GridCode','DLang']).size().unstack().fillna(0)

which correctly returns: 正确返回：

DLang     ca  en  es  ja  zh
GridCode                    
3          0   0   1   1   1
4          0   0   1   0   0
6          0   1   0   0   0
9          1   0   0   0   0

However, as you can see in df , some users have multiple DLang entries (eg User 224591119), but I only want to count their most frequent DLang code (eg for that user, it is es ). 但是，正如您在df看到的那样，某些用户具有多个DLang条目（例如，用户224591119），但是我只想计算他们最频繁的DLang代码（例如，对于该用户，它是es ）。 The resulting dataframe would be: 结果数据框将是：

DLang     ca  en  es
GridCode                    
3          0   0   1
4          0   0   1
6          0   1   0
9          1   0   0

Answer 1

First, count how many times a specific DLang occurred, averaging across GridCode . 首先，计算特定DLang发生次数，取平均值为GridCode 。

g = df.groupby(['User','DLang']).count().reset_index()
g = g.rename(columns={'GridCode':'occurrences'})

Then, use the first() function to find the most frequent/max occurrence for each user. 然后，使用first()函数查找每个用户的最频繁/最大出现次数。

h = g.groupby('User').first().reset_index()

Merge just the most frequent/max occurrence df with the original input. 仅将最频繁/最大出现次数df与原始输入合并。 This will drop rows where users used a DLang other than the most frequent 这将删除用户使用DLang而不是最频繁的行

j = pd.merge(df,h, on=['User','DLang'])

Finally, average across users to get your final counts. 最后，对所有用户进行平均，以得出最终结果。

final_df = j.groupby(['GridCode','DLang']).size().unstack().fillna(0)

DLang     ca  en  es
GridCode            
3          0   0   1
4          0   0   1
6          0   1   0
9          1   0   0

熊猫-在MultiIndex DataFrame中使用最频繁的值进行堆叠

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-21 18:48:08

熊猫-在MultiIndex DataFrame中使用最频繁的值进行堆叠

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-21 18:48:08

解决方案1
0 已采纳 2015-04-21 18:48:08