简体   繁体   English

计算R中的标准化欧几里得距离

[英]Computing normalized Euclidean distance in R

The data frame I have is as follows: 我拥有的数据帧如下:

Binning_data[1:4,]
  person_id  V1  V2  V3  V4    V5  V6  V7  V8    V9 V10 V11 V12 V13 V14 V15 V16
1       312  74  80  NA  87  90.0  85  88  98  96.5  99  94  95  90  90  93 106
2       316  NA  NA 116 106 105.0 110 102 105 105.0 102  98 101  98  92  89  91
3       318  71  61  61  61  60.5  68  62  67  64.0  60  59  60  62  59  63  63
4       319  64  NA  80  80  83.0  84  87  83  85.0  88  87  95  74  70  63  83

I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs. 我想与所有其他person_id一起计算给定“ index_person_id”(例如312)的欧几里得距离,而忽略所有NA。

For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. 例如:在“ 312”和“ 316”之间的标准化欧几里得距离应省略前3个面元(V1,V2,V3),因为两行中至少有一个具有NA。 It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins) 它应该只计算从第4个bin到第16个bin的欧几里得距离,然后除以13(非空bin的数量)

Dimension of Binning_Data is 10000*17. Binning_Data的尺寸为10000 * 17。

The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'. 输出文件的大小应为10000 * 2,第一列为person_id,第二列为“规范化的欧几里得距离”。

I am currently using sapply for this purpose: 我目前正在为此目的使用sapply:

index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)

distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {

compare_person<-binning_data[j,]    
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))    
})

This seems to take a considerable amount of time. 这似乎要花费大量时间。 Is there a better way to do this? 有一个更好的方法吗?

If I run your code I get: 如果我运行您的代码,我将得到:

 0.0000 146.0192 890.9000 200.8750

If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. 如果将数据帧转换为矩阵并转置,则可以减去列,然后对mean使用na.rm=TRUE来获取所需的距离。 This can be done over columns using colMeans . 这可以使用colMeans在列上完成。 Here for row II of your sample data: 这里是示例数据的II行:

> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
       1        2        3        4 
  0.0000 146.0192 890.9000 200.8750 

Your 10000x2 matrix is then (where here 10000==4): 然后,您的10000x2矩阵是(这里10000 == 4):

> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
  II         
1  1   0.0000
2  1 146.0192
3  1 890.9000
4  1 200.8750

If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change: 如果要针对给定的索引列表进行计算,则可以对其进行循环,例如,通过使用lapplyrbind将其重新组合在一起作为更改的数据帧:

II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
   i         d
1  1    0.0000
2  1  146.0192
3  1  890.9000
4  1  200.8750
11 2  146.0192
21 2    0.0000
31 2 1595.0179
41 2  456.7143
12 1    0.0000
22 1  146.0192
32 1  890.9000
42 1  200.8750
13 4  200.8750
23 4  456.7143
33 4  420.8833
43 4    0.0000
14 4  200.8750
24 4  456.7143
34 4  420.8833
44 4    0.0000

That's a 4 x length(II) -row matrix 那是4 x length(II)行矩阵

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM