简体   繁体   中英

Fast distance calculation in R

I'm trying to calculate the

1) Euclidean distance, and

2) Mahalanobis distance

for a set of matricies in r. I've been doing it as such:

v1 <- structure(c(0.508, 0.454, 0, 2.156, 0.468, 0.488, 0.682, 1, 1.832, 
            0.44, 0.928, 0.358, 1, 1.624, 0.484, 0.516, 0.378, 1, 1.512, 
            0.514, 0.492, 0.344, 0, 1.424, 0.508, 0.56, 0.36, 1, 1.384, 0.776, 
            1.888, 0.388, 0, 1.464, 0.952, 0.252, 0.498, 1, 1.484, 0.594, 
            0.256, 0.54, 2, 2.144, 0.402, 0.656, 2.202, 1, 1.696, 0.252), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

v2 <- structure(c(1.864, 1.864, 1.864, 1.864, 1.864, 1.6, 1.6, 1.6, 
            1.6, 1.6, 1.536, 1.536, 1.536, 1.536, 1.536, 1.384, 1.384, 1.384, 
            1.384, 1.384, 6.368, 6.368, 6.368, 6.368, 6.368, 2.792, 2.792, 
            2.792, 2.792, 2.792, 2.352, 2.352, 2.352, 2.352, 2.352, 2.624, 
            2.624, 2.624, 2.624, 2.624, 1.256, 1.256, 1.256, 1.256, 1.256, 
            1.224, 1.224, 1.224, 1.224, 1.224), 
          .Dim = c(5L, 10L), 
          .Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))

L2 <- sqrt(rowSums((v1-v2)^2))  # Euclidean distance for each row

which provides:

[1] 7.132452 7.568359 7.536904 5.448696 7.163580

That's perfect: But I've heard you can also compute Euclidean/L2 distance using the following form:

在此处输入图像描述

I'd like to calculate my distance this way because the Mahalanobis distance is simply this and the covariance matrix. See this .

I haven't figured out how to code this in r, however. I've tried:

sqrt(crossprod((t(v1)-t(v2))))

and

sqrt((v1-v2) %*% t(v1-v2))

But they just don't give me what I want. Suggestions?

Note - I'm looking to do this as a single operation, not in a loop of any kind. It has to be very fast because I'm doing it over millions of rows multiple times. Maybe it's not possible. I'm open to changing the format of v1 and v2 .

You need to apply the formula to each row individually, so something like:

> sapply(1:nrow(v1), function(i) {
+     q = v1[i, ] - v2[i, ]
+     d = sqrt(t(q) %*% q)
+     d
+ })
[1] 7.132452 7.568359 7.536904 5.448696 7.163580

If you need something faster you can always try the same thing in C++ (code adapted from here ):

#include <Rcpp.h>

using namespace Rcpp;

double dist2(NumericVector x, NumericVector y){
    double d = sqrt( sum( pow(x - y, 2) ) );
    return d;
}

// [[Rcpp::export]]
NumericVector calc_l2 (NumericMatrix x, NumericMatrix y){
    int out_length = x.nrow();
    NumericVector out(out_length);

    for (int i = 0 ; i < out_length; i++){
        NumericVector v1 = x.row(i);
        NumericVector v2 = y.row(i);
        double d = dist2(v1, v2);
        out(i) = d;
    }
    return (out) ;
}

Running in R:

library(Rcpp)

sourceCpp("calc_L2.cpp")
calc_l2(v1, v2)

The Rcpp code by Marius becomes about 10 times faster if you inline the function call, but it's still about as fast as sqrt(rowSums((m1-m2)^2)) :

library(Rcpp)

sourceCpp("r/calc_L2.cpp") # original by Marius

cppFunction('NumericVector calc_l2_inline(NumericMatrix x,NumericMatrix y){
  int nrow=x.nrow();
  NumericVector out(nrow);
  for(int i=0;i<nrow;i++)out(i)=sqrt(sum(pow(x.row(i)-y.row(i),2)));
  return(out);
}')

ncol=10
nrow=1e5
m1=matrix(runif(ncol*nrow),nrow)
m2=matrix(runif(ncol*nrow),nrow)

microbenchmark(times=100,
  rowSums={sqrt(rowSums((m1-m2)^2))},
  `Rfast::rowsums`={sqrt(Rfast::rowsums((m1-m2)^2))},
  Rcpp_original={calc_l2(m1,m2)},
  Rcpp_inlined={calc_l2_inline(m1,m2)},
  sapply_dotproduct={sapply(1:nrow(m1),function(i){q=m1[i,]-m2[i,];sqrt(q%*%q)})},
  sapply_regular={sapply(1:nrow(m1),function(i)sqrt(sum((m1[i,]-m2[i,])^2)))},
  for_loop={o=numeric(nrow);for(i in 1:nrow)o[i]=sqrt(sum((m1[i,]-m2[i,])^2))},
  mapply={r=row(m1);mapply(function(x,y)sqrt(sum((x-y)^2)),split(m1,r),split(m2,r))}
)

Result:

Unit: milliseconds
              expr        min         lq       mean     median         uq         max neval
           rowSums   4.295901   4.708260   5.761508   5.461944   6.496243   10.247036   100
    Rfast::rowsums   3.004092   3.327411   4.135796   3.451392   5.731450    6.877907   100
     Rcpp_original  37.777999  39.480606  51.307351  43.006943  61.729813  176.979826   100
      Rcpp_inlined   4.232740   4.283238   4.379944   4.332177   4.400327    5.462128   100
 sapply_dotproduct 473.272534 538.187874 615.304276 611.288368 669.466721  875.786952   100
    sapply_regular 197.353688 233.303991 275.858154 260.292042 302.541703  536.336035   100
          for_loop 130.624967 153.188579 195.365026 190.774655 219.141935  526.906898   100
            mapply 603.384269 662.399258 717.631411 695.372090 738.274394 1038.938268   100

This is a fast way to calculate the distance of a vector v to each row of matrix m :

sqrt(rowSums(m^2)+sum(v^2)-2*m%*%as.matrix(v)[,1])

Or if you have one matrix that has m rows and another matrix that has n rows, then the following is a fast way to calculate an m by n distance matrix between each combination of rows in the matrices: (using tcrossprod(m1,m2) instead of m1%*%t(m2) made the code about 1% faster):

sqrt(outer(rowSums(m1^2),rowSums(m2^2),"+")-2*tcrossprod(m1,m2))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM