加快计算 R 中的逐点差值之和

Question

Suppose I have two datasets.假设我有两个数据集。 The first one is:第一个是：

t1<-sample(1:10,10,replace = T)
t2<-sample(1:10,10,replace = T)
t3<-sample(1:10,10,replace = T)
t4<-sample(11:20,10,replace = T)
t5<-sample(11:20,10,replace = T)
xtrain<-rbind(t1,t2,t3,t4,t5)
xtrain
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
t1    7    3    9   10    4    9    2    1    6     9
t2    5    1    1    6    5    3   10    2    6     3
t3    8    6    9    7    9    2    3    5    1     8
t4   16   18   14   17   19   20   15   15   20    19
t5   13   14   18   13   11   19   13   17   16    14

The second one is:第二个是：

t6<-sample(1:10,10,replace = T)
t7<-sample(11:20,10,replace = T)
xtest<-rbind(t6,t7)
xtest
   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
t6    1    5    8    2   10    2    3    4    8     5
t7   14   18   15   12   17   20   17   13   16    17

What I did like to do is to calculate the sum of the distance between each row of xtest and each row of xtrain .我想做的是计算每行xtest和每行xtrain之间的距离之和。 For example:例如：

sum((7-1)^2+(3-5)^2+(9-8)^2+.....(9-5)^2)
sum((5-1)^2+(1-5)^2+(1-8)^2+.....(4-5)^2)
...
sum((14-13)^2+(18-14)^2+(15-18)^2+.....(17-14)^2)

What I currently have is to use two for-loops (see below), which I don't think can handle large data sets:我目前拥有的是使用两个 for 循环（见下文），我认为它不能处理大型数据集：

sumPD<-function(vector1,vector2){
  sumPD1<-sum((vector1-vector2)^2)
  return(sumPD1)
}
loc<-matrix(NA,nrow=dim(xtrain)[1],ncol=dim(xtest)[1])
for(j in 1:dim(xtest)[1]){    
  for(i in 1:dim(xtrain)[1]){
     loc[i,j]<-sumPD(xtrain[i,],xtest[j,])
   }
 }

I'd like to ask for suggestions on how to modify the code to make it efficient.我想就如何修改代码以提高效率征求建议。 Thank you in advance!先感谢您！ Hope to have a good discussion!希望有好的讨论！

Answer 1

The rdist package has functions for quickly calculating these kinds of pairwise distances: rdist package 具有快速计算这些成对距离的功能：

rdist::cdist(xtrain, xtest)^2

Output: Output：

     [,1] [,2]
[1,]   65 1029
[2,]   94 1324
[3,]  165 1103
[4,] 1189  213
[5,] 1271  191

Answer 2

One option would be outer一种选择是outer

f1 <- Vectorize(function(i, j) sumPD(xtrain[i,], xtest[j,]))
loc2 <- outer(seq_len(nrow(xtrain)), seq_len(nrow(xtest)), f1)
identical(loc, loc2)
#[1] TRUE

Answer 3

You could transpose your matrix, use vector difference and a single loop:您可以转置矩阵，使用向量差异和单个循环：

ftrain <- t(xtrain)
ftest <- t(xtest)


sapply(1:(dim(ftest)[2]),function(i){
  colSums((ftrain - ftest[,i])^2)
})


   [,1] [,2]
t1  103 1182
t2  125 1262
t3  130 1121
t4 1478  159
t5 1329  142

colSums is quite efficient, but have a look there if you want more speed colSums非常有效，但是如果您想要更快的速度，请查看那里

Answer 4

Here are two simple ways.这里有两种简单的方法。

Using dist - will calculate more distances than needed:使用dist - 将计算比需要更多的距离：

dists <- as.matrix(dist(rbind(xtrain, xtest))^2)
dists <- dists[rownames(xtrain), rownames(xtest)]
dists
     t6   t7
t1  140 1179
t2  134  693
t3  119  974
t4 1028   91
t5 1085   44

Using a simple custom functions that works on X matrix and y vector:使用适用于 X 矩阵和 y 向量的简单自定义函数：

euclid <- function(X,y) colSums((X-y)^2)
dists  <- mapply(euclid, list(t(xtrain)), split(xtest, row(xtest)))
dists
   [,1] [,2]
t1  140 1179
t2  134  693
t3  119  974
t4 1028   91
t5 1085   44

加快计算 R 中的逐点差值之和

问题描述

4 个解决方案

解决方案1
3 已采纳 2019-10-03 23:07:39

解决方案2
2 2019-10-03 22:51:47

解决方案3
1 2019-10-03 23:11:57

解决方案4
0 2019-10-03 23:10:52

加快计算 R 中的逐点差值之和

问题描述

4 个解决方案

解决方案1 3 已采纳 2019-10-03 23:07:39

解决方案2 2 2019-10-03 22:51:47

解决方案3 1 2019-10-03 23:11:57

解决方案4 0 2019-10-03 23:10:52

解决方案1
3 已采纳 2019-10-03 23:07:39

解决方案2
2 2019-10-03 22:51:47

解决方案3
1 2019-10-03 23:11:57

解决方案4
0 2019-10-03 23:10:52