R的相似性

Question

I have data for patient IDs and hospitals where these patients were treated. 我有患者ID以及治疗这些患者的医院的数据。 I want to calculate Jaccard similarity. 我想计算Jaccard相似度。 Below is the sample data. 以下是示例数据。

HospitalID  CustID
1   1
2   1
1   2
4   2
1   3
2   3
3   3

The calculation of Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2 . Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2的Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2的计算Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2 。 It will be 2/(3+2-2). 它将是2 /（3 + 2-2）。 I need to calculate it for all the combination of hospitals ie (H1,H2) (H1,H3) (H1,H4) (H2,H4) (H3,H4). 我需要针对所有医院组合（即（H1，H2）（H1，H3）（H1，H4）（H2，H4）（H3，H4）进行计算。

In real dataset, I have data for more than 2000 hospitals and 100K insureds. 在真实数据集中，我拥有2000多家医院和10万名被保险人的数据。 There are many packages available in R which calculates Jaccard distance but I will have to transpose data and put insured IDs in columns which is not feasible as there are more than 100K insureds. R中有很多可用的软件包可以计算Jaccard距离，但是我将不得不转置数据并将被保险人ID放入列中，这是不可行的，因为被保险人超过100K。 Sample R dataset show below - 示例R数据集如下所示-

dt = read.table(header = TRUE, 
text ="HospitalID   CustID
                1   1
                2   1
                1   2
                3   2
                1   3
                2   3
                3   3
                ")

Output should look like below - 输出应如下所示-

Comb1   Comb2   Score
H1  H2  0.67
H1  H3  some_value
H1  H4  some_value
H2  H3  some_value
H2  H4  some_value
H3  H4  some_value

Answer 1

Here is a base R solution that is very direct: 这是一个非常直接的基本R解决方案：

uniHosp <- unique(dt$HospitalID)
myCombs <- combn(uniHosp, 2)

myOut <- data.frame(Comb1 = paste0("H", myCombs[1, ]),
                    Comb2 = paste0("H", myCombs[2, ]),
                    stringsAsFactors = FALSE)

myHosp <- dt$HospitalID
myCust <- dt$CustID

 myOut$Jaccard <- sapply(1:ncol(myCombs), function(x) {
    inA <- myCust[myHosp == myCombs[1, x]]
    inB <- myCust[myHosp == myCombs[2, x]]
    length(intersect(inA, inB))/length(union(inA, inB))
})

 myOut
   Comb1 Comb2   Jaccard
 1    H1    H2 0.6666667
 2    H1    H3 0.6666667
 3    H2    H3 0.3333333

There is probably a much faster approach using data.table or dplyr , but the above should get you started in the right direction. 使用data.table或dplyr可能是一种更快的方法，但是以上所述应该可以帮助您正确地开始工作。

R的相似性

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-06-05 19:57:02

R的相似性

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-06-05 19:57:02

解决方案1
2 已采纳 2018-06-05 19:57:02