[英]Similarity in R
I have data for patient IDs and hospitals where these patients were treated. 我有患者ID以及治疗这些患者的医院的数据。 I want to calculate Jaccard similarity.
我想计算Jaccard相似度。 Below is the sample data.
以下是示例数据。
HospitalID CustID
1 1
2 1
1 2
4 2
1 3
2 3
3 3
The calculation of Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2
. Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2
的Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2
的计算Jaccard Index for (Hospital1,Hospital2) = No. of patients treated by H1 and H2 / Union of patients treated by H1 and H2
。 It will be 2/(3+2-2). 它将是2 /(3 + 2-2)。 I need to calculate it for all the combination of hospitals ie (H1,H2) (H1,H3) (H1,H4) (H2,H4) (H3,H4).
我需要针对所有医院组合(即(H1,H2)(H1,H3)(H1,H4)(H2,H4)(H3,H4)进行计算。
In real dataset, I have data for more than 2000 hospitals and 100K insureds. 在真实数据集中,我拥有2000多家医院和10万名被保险人的数据。 There are many packages available in R which calculates Jaccard distance but I will have to transpose data and put insured IDs in columns which is not feasible as there are more than 100K insureds.
R中有很多可用的软件包可以计算Jaccard距离,但是我将不得不转置数据并将被保险人ID放入列中,这是不可行的,因为被保险人超过100K。 Sample R dataset show below -
示例R数据集如下所示-
dt = read.table(header = TRUE,
text ="HospitalID CustID
1 1
2 1
1 2
3 2
1 3
2 3
3 3
")
Output should look like below - 输出应如下所示-
Comb1 Comb2 Score
H1 H2 0.67
H1 H3 some_value
H1 H4 some_value
H2 H3 some_value
H2 H4 some_value
H3 H4 some_value
Here is a base R solution that is very direct: 这是一个非常直接的基本R解决方案:
uniHosp <- unique(dt$HospitalID)
myCombs <- combn(uniHosp, 2)
myOut <- data.frame(Comb1 = paste0("H", myCombs[1, ]),
Comb2 = paste0("H", myCombs[2, ]),
stringsAsFactors = FALSE)
myHosp <- dt$HospitalID
myCust <- dt$CustID
myOut$Jaccard <- sapply(1:ncol(myCombs), function(x) {
inA <- myCust[myHosp == myCombs[1, x]]
inB <- myCust[myHosp == myCombs[2, x]]
length(intersect(inA, inB))/length(union(inA, inB))
})
myOut
Comb1 Comb2 Jaccard
1 H1 H2 0.6666667
2 H1 H3 0.6666667
3 H2 H3 0.3333333
There is probably a much faster approach using data.table
or dplyr
, but the above should get you started in the right direction. 使用
data.table
或dplyr
可能是一种更快的方法,但是以上所述应该可以帮助您正确地开始工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.