Pandas 按两列分组并在第三列计算共享值

Question

In Pandas I would like to groupby two columns and calculate how many third column values are shared.在 Pandas 中，我想对两列进行分组并计算共享的第三列值的数量。 With the addition of preference for greater sharing.随着对更大共享的偏好的增加。

In the dataframe below, group col1 values, group col2 values and count how often col3 values are shared by col2 values.在下面的 dataframe 中，对 col1 值进行分组，对 col2 值进行分组，并计算 col2 值共享 col3 值的频率。

The result is: ID1 & ID2 share a col3 value (2).结果是：ID1 和 ID2 共享一个 col3 值 (2)。 ID3 shares with none (1). ID3 与 none (1) 共享。 However, ID1, ID2 and ID4 also share a value (3).但是，ID1、ID2 和 ID4 也共享一个值 (3)。 As ID1 & ID2 already share a value take the value that is shared by both IDs and more (3).由于 ID1 和 ID2 已经共享一个值，因此采用两个 ID 和更多 ID 共享的值 (3)。 Therefore the answer is 3,1.因此答案是3,1。 The list of counts must always = the nunique col2 values.计数列表必须始终 = 唯一的 col2 值。

col1 col1	col2列2	col3列3
A一种	ID1 ID1	15 15
A一种	ID1 ID1	16 16
A一种	ID1 ID1	12 12
A一种	ID2 ID2	15 15
A一种	ID2 ID2	12 12
A一种	ID3 ID3	18 18
A一种	ID4 ID4	19 19
A一种	ID4 ID4	12 12

Answer 1

If I am understanding you correctly, I think you want to group by col3 instead of col2 :如果我对你的理解正确，我想你想按col3而不是col2分组：

df = pd.read_html('https://stackoverflow.com/q/69419264/14277722')[0]

df = df.groupby(['col1','col3'])['col2'].apply(list).reset_index()
df['count'] = df['col2'].apply(len)

You can then remove rows where col2 is a subset of another row with the following :然后，您可以删除col2是另一行的子集的行，其中包含以下内容：

arr = pd.get_dummies(df['col2'].explode()).max(level=0).to_numpy()
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0)
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)

df = df[mask]

   col1 col3             col2  count
0     A   12  [ID1, ID2, ID4]      3
3     A   18            [ID3]      1

Pandas 按两列分组并在第三列计算共享值

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-10-02 22:14:16

Pandas 按两列分组并在第三列计算共享值

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-10-02 22:14:16

解决方案1
2 已采纳 2021-10-02 22:14:16