简体   繁体   中英

Pandas groupby two columns and count shared values in third

In Pandas I would like to groupby two columns and calculate how many third column values are shared. With the addition of preference for greater sharing.

In the dataframe below, group col1 values, group col2 values and count how often col3 values are shared by col2 values.

The result is: ID1 & ID2 share a col3 value (2). ID3 shares with none (1). However, ID1, ID2 and ID4 also share a value (3). As ID1 & ID2 already share a value take the value that is shared by both IDs and more (3). Therefore the answer is 3,1. The list of counts must always = the nunique col2 values.

col1 col2 col3
A ID1 15
A ID1 16
A ID1 12
A ID2 15
A ID2 12
A ID3 18
A ID4 19
A ID4 12

If I am understanding you correctly, I think you want to group by col3 instead of col2 :

df = pd.read_html('https://stackoverflow.com/q/69419264/14277722')[0]

df = df.groupby(['col1','col3'])['col2'].apply(list).reset_index()
df['count'] = df['col2'].apply(len)

You can then remove rows where col2 is a subset of another row with the following :

arr = pd.get_dummies(df['col2'].explode()).max(level=0).to_numpy()
subsets = np.matmul(arr, arr.T)
np.fill_diagonal(subsets, 0)
mask = ~np.equal(subsets, np.sum(arr, 1)).any(0)

df = df[mask]
   col1 col3             col2  count
0     A   12  [ID1, ID2, ID4]      3
3     A   18            [ID3]      1     

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM