[英]Pandas get most frequent values used together in the same column
I have a dataset that contains only two columns user_id
and channel
. 我有一个仅包含两列
user_id
和channel
的数据集。 Channel column can assume values from a pre-defined list [a,b,c,d]
. Channel列可以采用来自预定义列表
[a,b,c,d]
。 There are multiple rows with the same user_id
. 有多个具有相同
user_id
行。 Each row can contain any of the above channels. 每行可以包含上述任何一个通道。
If I consider the unique channels that each user visited, what set occurs most frequently? 如果考虑每个用户访问的唯一渠道,哪个集合最常出现?
Example dataframe: 示例数据框:
>>> df = pd.DataFrame([[1, 'a'], [1, 'b'], [1, 'b'], [1,'b'], [2,'c'], [2,'a'], [2,'a'], [2,'b'], [3,'a'], [3,'b']], columns=['user_id', 'Channel'])
>>> df
user_id Channel
0 1 a
1 1 b
2 1 b
3 1 b
4 2 c
5 2 a
6 2 a
7 2 b
8 3 a
9 3 b
Expected solution: 预期解决方案:
for the above example would be something like: 对于上面的示例将是这样的:
user_id == 1
the set of unique Channels is {a, b}
and that counts once for that combination. user_id == 1
,唯一通道的集合为{a, b}
并且对该组合计数一次。 user_id == 2
the set of unique Channels is {a, b, c}
and that counts once for that combination. user_id == 2
,唯一通道的集合为{a, b, c}
并且对该组合计数一次。 Note that this does not count for any subsets of these unique Channels. user_id == 3
the set of unique Channels is {a, b}
and that counts once for that combination. user_id == 3
,唯一通道的集合为{a, b}
并且对该组合计数一次。 If we count the one combination of unique Channels for each user_id
we should get 如果我们为每个
user_id
计算唯一渠道的一种组合,我们应该得到
>>> df_result = pd.DataFrame([['a,b', 2], ['a,b,c', 1]], columns=['Channels_together', 'n'])
>>> df_result
Channels_together n
0 a,b 2
1 a,b,c 1
I have come up with a solution which is to pivot the table so that I get user_id
, and columns a
, b
, c
, d
then assign an integer to each Channel column if not NA, then sum across columns and convert back the results to each combination. 我想出了一种解决方案,该方法是旋转表,以便获得
user_id
和a
, b
, c
和d
列,然后为每个Channel列分配一个整数(如果不是NA的话),然后跨列求和并将结果转换回每个组合。
I'm sure there is a better way to do this but I can't seem to find out how. 我敢肯定有更好的方法可以做到这一点,但是我似乎找不到答案。
You can use groupby.apply(set)
and then count the values with .value_counts
: 您可以使用
groupby.apply(set)
,然后用算值.value_counts
:
df.groupby('user_id')['Channel'].apply(set).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output 输出量
Channels_together n
0 {a, b} 2
1 {a, c, b} 1
If you want your values in str
format we can write a lambda
function to sort our set and convert it to string: 如果您想使用
str
格式的值,我们可以编写一个lambda
函数来对集合进行排序并将其转换为字符串:
df.groupby('user_id')['Channel'].apply(lambda x: ', '.join(sorted(set(x)))).value_counts()\
.reset_index(name='n')\
.rename(columns={'index':'Channels_together'})
Output 输出量
Channels_together n
0 a, b 2
1 a, b, c 1
frozenset
Is hashable and can be counted 可散列并且可以计数
df.groupby('user_id').Channel.apply(frozenset).value_counts()
(a, b) 2
(a, b, c) 1
Name: Channel, dtype: int64
And we can tailor this to precisely what OP has with 我们可以根据OP的具体情况进行定制
c = df.groupby('user_id').Channel.apply(frozenset).value_counts()
pd.DataFrame({'Channels_together': c.index.str.join(', '), 'n': c.values})
Channels_together n
0 a, b 2
1 a, b, c 1
Alternatively 或者
df.groupby('user_id').Channel.apply(frozenset).str.join(', ') \
.value_counts().rename_axis('Channels_together').reset_index(name='n')
Channels_together n
0 a, b 2
1 a, b, c 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.