简体   繁体   English

计算 pandas dataframe 中多个 boolean 列的成对重叠

[英]Calculating pairwise overlap for multiple boolean columns in pandas dataframe

I have a pandas dataframe with multiple boolean columns.我有一个 pandas dataframe 与多个 boolean 列。 I would like to find the pairwise overlap between all these columns.我想找到所有这些列之间的成对重叠。 The overlap should be something like the proportion of overlap between two columns excluding cases where both are zero.重叠应该类似于两列之间重叠的比例,不包括两者都为零的情况。 Like a jaccard score but I would like to exclude the cases where both elements are zero.就像 Jaccard 分数一样,但我想排除两个元素都为零的情况。

Dataframe example: Dataframe 示例:

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.binomial(1, 0.5, size=(100, 5)), columns=list('ABCDE'))
print(df.head())

   A  B  C  D  E
0  1  1  1  1  0
1  1  0  1  1  0
2  1  1  1  1  0
3  0  0  1  1  1
4  1  1  0  1  0

I would ideally like a solution like this (from this similar question How to compute jaccard similarity from a pandas dataframe ):理想情况下,我想要这样的解决方案(来自这个类似的问题How to compute jaccardsimilarity from a pandas dataframe ):

from sklearn.metrics.pairwise import pairwise_distances
jac_sim = pairwise_distances(df.T, metric = "jaccard")
jac_sim = pd.DataFrame(jac_sim, index=df.columns, columns=df.columns)

Just excluding the cases where both elements from two columns are False.仅排除两列中的两个元素均为 False 的情况。

Does something like this help?这样的事情有帮助吗?

df['AB'] = df['A'] + df['B']
vcs = df['AB'].value_counts()
prop = vcs[2] / (vcs[1] + vcs[2]) # Two means overlap, 1 means no overlap

print(prop)

One option is to call scipy.spatial.distance.cdist with your custom distance function:一种选择是使用您的自定义距离 function 调用scipy.spatial.distance.cdist

from scipy.spatial.distance import cdist

def f(a, b):
  both_one = ((a & b) == 1).sum()
  different = (a != b).sum()
  return 1 - different / (different + both_one)

dists = pd.DataFrame(cdist(df.T, df.T, f), index=df.columns, columns=df.columns)
#           A         B         C         D         E
# A  1.000000  0.240000  0.380952  0.391892  0.260274
# B  0.240000  1.000000  0.323944  0.428571  0.320000
# C  0.380952  0.323944  1.000000  0.333333  0.328571
# D  0.391892  0.428571  0.333333  1.000000  0.362500
# E  0.260274  0.320000  0.328571  0.362500  1.000000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM