计算R中的标准化欧几里得距离

Question

The data frame I have is as follows: 我拥有的数据帧如下：

Binning_data[1:4,]
  person_id  V1  V2  V3  V4    V5  V6  V7  V8    V9 V10 V11 V12 V13 V14 V15 V16
1       312  74  80  NA  87  90.0  85  88  98  96.5  99  94  95  90  90  93 106
2       316  NA  NA 116 106 105.0 110 102 105 105.0 102  98 101  98  92  89  91
3       318  71  61  61  61  60.5  68  62  67  64.0  60  59  60  62  59  63  63
4       319  64  NA  80  80  83.0  84  87  83  85.0  88  87  95  74  70  63  83

I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs. 我想与所有其他person_id一起计算给定“ index_person_id”（例如312）的欧几里得距离，而忽略所有NA。

For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. 例如：在“ 312”和“ 316”之间的标准化欧几里得距离应省略前3个面元（V1，V2，V3），因为两行中至少有一个具有NA。 It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins) 它应该只计算从第4个bin到第16个bin的欧几里得距离，然后除以13（非空bin的数量）

Dimension of Binning_Data is 10000*17. Binning_Data的尺寸为10000 * 17。

The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'. 输出文件的大小应为10000 * 2，第一列为person_id，第二列为“规范化的欧几里得距离”。

I am currently using sapply for this purpose: 我目前正在为此目的使用sapply：

index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)

distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {

compare_person<-binning_data[j,]    
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))    
})

This seems to take a considerable amount of time. 这似乎要花费大量时间。 Is there a better way to do this? 有一个更好的方法吗？

Answer 1

If I run your code I get: 如果我运行您的代码，我将得到：

 0.0000 146.0192 890.9000 200.8750

If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. 如果将数据帧转换为矩阵并转置，则可以减去列，然后对mean使用na.rm=TRUE来获取所需的距离。 This can be done over columns using colMeans . 这可以使用colMeans在列上完成。 Here for row II of your sample data: 这里是示例数据的II行：

> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
       1        2        3        4 
  0.0000 146.0192 890.9000 200.8750

Your 10000x2 matrix is then (where here 10000==4): 然后，您的10000x2矩阵是（这里10000 == 4）：

> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
  II         
1  1   0.0000
2  1 146.0192
3  1 890.9000
4  1 200.8750

If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change: 如果要针对给定的索引列表进行计算，则可以对其进行循环，例如，通过使用lapply和rbind将其重新组合在一起作为更改的数据帧：

II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
   i         d
1  1    0.0000
2  1  146.0192
3  1  890.9000
4  1  200.8750
11 2  146.0192
21 2    0.0000
31 2 1595.0179
41 2  456.7143
12 1    0.0000
22 1  146.0192
32 1  890.9000
42 1  200.8750
13 4  200.8750
23 4  456.7143
33 4  420.8833
43 4    0.0000
14 4  200.8750
24 4  456.7143
34 4  420.8833
44 4    0.0000

That's a 4 x length(II) -row matrix 那是4 x length(II)行矩阵

计算R中的标准化欧几里得距离

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-09-10 07:24:04

计算R中的标准化欧几里得距离

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-09-10 07:24:04

解决方案1
1 已采纳 2014-09-10 07:24:04