简体   繁体   English

熊猫在同一列中获得最常用的值

[英]Pandas get most frequent values used together in the same column

I have a dataset that contains only two columns user_id and channel . 我有一个仅包含两列user_idchannel的数据集。 Channel column can assume values from a pre-defined list [a,b,c,d] . Channel列可以采用来自预定义列表[a,b,c,d] There are multiple rows with the same user_id . 有多个具有相同user_id行。 Each row can contain any of the above channels. 每行可以包含上述任何一个通道。

If I consider the unique channels that each user visited, what set occurs most frequently? 如果考虑每个用户访问的唯一渠道,哪个集合最常出现?

Example dataframe: 示例数据框:

>>> df = pd.DataFrame([[1, 'a'], [1, 'b'], [1, 'b'], [1,'b'], [2,'c'], [2,'a'], [2,'a'], [2,'b'], [3,'a'], [3,'b']], columns=['user_id', 'Channel'])
>>> df
   user_id Channel
0        1       a
1        1       b
2        1       b
3        1       b
4        2       c
5        2       a
6        2       a
7        2       b
8        3       a
9        3       b

Expected solution: 预期解决方案:

for the above example would be something like: 对于上面的示例将是这样的:

  • For user_id == 1 the set of unique Channels is {a, b} and that counts once for that combination. 对于user_id == 1 ,唯一通道的集合为{a, b}并且对该组合计数一次。
  • For user_id == 2 the set of unique Channels is {a, b, c} and that counts once for that combination. 对于user_id == 2 ,唯一通道的集合为{a, b, c}并且对该组合计数一次。 Note that this does not count for any subsets of these unique Channels. 注意,这不计入这些唯一通道的任何子集。
  • For user_id == 3 the set of unique Channels is {a, b} and that counts once for that combination. 对于user_id == 3 ,唯一通道的集合为{a, b}并且对该组合计数一次。

If we count the one combination of unique Channels for each user_id we should get 如果我们为每个user_id计算唯一渠道的一种组合,我们应该得到

>>> df_result = pd.DataFrame([['a,b', 2], ['a,b,c', 1]], columns=['Channels_together', 'n'])
>>> df_result
  Channels_together  n
0               a,b  2
1             a,b,c  1

I have come up with a solution which is to pivot the table so that I get user_id , and columns a , b , c , d then assign an integer to each Channel column if not NA, then sum across columns and convert back the results to each combination. 我想出了一种解决方案,该方法是旋转表,以便获得user_idabcd列,然后为每个Channel列分配一个整数(如果不是NA的话),然后跨列求和并将结果转换回每个组合。

I'm sure there is a better way to do this but I can't seem to find out how. 我敢肯定有更好的方法可以做到这一点,但是我似乎找不到答案。

You can use groupby.apply(set) and then count the values with .value_counts : 您可以使用groupby.apply(set) ,然后用算值.value_counts

df.groupby('user_id')['Channel'].apply(set).value_counts()\
  .reset_index(name='n')\
  .rename(columns={'index':'Channels_together'})

Output 输出量

  Channels_together  n
0            {a, b}  2
1         {a, c, b}  1

If you want your values in str format we can write a lambda function to sort our set and convert it to string: 如果您想使用str格式的值,我们可以编写一个lambda函数来对集合进行排序并将其转换为字符串:

df.groupby('user_id')['Channel'].apply(lambda x: ', '.join(sorted(set(x)))).value_counts()\
  .reset_index(name='n')\
  .rename(columns={'index':'Channels_together'})

Output 输出量

  Channels_together  n
0              a, b  2
1           a, b, c  1

frozenset

Is hashable and can be counted 可散列并且可以计数

df.groupby('user_id').Channel.apply(frozenset).value_counts()

(a, b)       2
(a, b, c)    1
Name: Channel, dtype: int64

And we can tailor this to precisely what OP has with 我们可以根据OP的具体情况进行定制

c = df.groupby('user_id').Channel.apply(frozenset).value_counts()
pd.DataFrame({'Channels_together': c.index.str.join(', '), 'n': c.values})

  Channels_together  n
0              a, b  2
1           a, b, c  1

Alternatively 或者

df.groupby('user_id').Channel.apply(frozenset).str.join(', ') \
  .value_counts().rename_axis('Channels_together').reset_index(name='n')

  Channels_together  n
0              a, b  2
1           a, b, c  1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM