I'm trying to calculate the
1) Euclidean distance, and
2) Mahalanobis distance
for a set of matricies in r. I've been doing it as such:
v1 <- structure(c(0.508, 0.454, 0, 2.156, 0.468, 0.488, 0.682, 1, 1.832,
0.44, 0.928, 0.358, 1, 1.624, 0.484, 0.516, 0.378, 1, 1.512,
0.514, 0.492, 0.344, 0, 1.424, 0.508, 0.56, 0.36, 1, 1.384, 0.776,
1.888, 0.388, 0, 1.464, 0.952, 0.252, 0.498, 1, 1.484, 0.594,
0.256, 0.54, 2, 2.144, 0.402, 0.656, 2.202, 1, 1.696, 0.252),
.Dim = c(5L, 10L),
.Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))
v2 <- structure(c(1.864, 1.864, 1.864, 1.864, 1.864, 1.6, 1.6, 1.6,
1.6, 1.6, 1.536, 1.536, 1.536, 1.536, 1.536, 1.384, 1.384, 1.384,
1.384, 1.384, 6.368, 6.368, 6.368, 6.368, 6.368, 2.792, 2.792,
2.792, 2.792, 2.792, 2.352, 2.352, 2.352, 2.352, 2.352, 2.624,
2.624, 2.624, 2.624, 2.624, 1.256, 1.256, 1.256, 1.256, 1.256,
1.224, 1.224, 1.224, 1.224, 1.224),
.Dim = c(5L, 10L),
.Dimnames = list(NULL, c("KW_1", "KW_2", "KW_3", "KW_4", "KW_5", "KW_6", "KW_7", "KW_8", "KW_9", "KW_10")))
L2 <- sqrt(rowSums((v1-v2)^2)) # Euclidean distance for each row
which provides:
[1] 7.132452 7.568359 7.536904 5.448696 7.163580
That's perfect: But I've heard you can also compute Euclidean/L2 distance using the following form:
I'd like to calculate my distance this way because the Mahalanobis distance is simply this and the covariance matrix. See this .
I haven't figured out how to code this in r, however. I've tried:
sqrt(crossprod((t(v1)-t(v2))))
and
sqrt((v1-v2) %*% t(v1-v2))
But they just don't give me what I want. Suggestions?
Note - I'm looking to do this as a single operation, not in a loop of any kind. It has to be very fast because I'm doing it over millions of rows multiple times. Maybe it's not possible. I'm open to changing the format of v1
and v2
.
You need to apply the formula to each row individually, so something like:
> sapply(1:nrow(v1), function(i) {
+ q = v1[i, ] - v2[i, ]
+ d = sqrt(t(q) %*% q)
+ d
+ })
[1] 7.132452 7.568359 7.536904 5.448696 7.163580
If you need something faster you can always try the same thing in C++ (code adapted from here ):
#include <Rcpp.h>
using namespace Rcpp;
double dist2(NumericVector x, NumericVector y){
double d = sqrt( sum( pow(x - y, 2) ) );
return d;
}
// [[Rcpp::export]]
NumericVector calc_l2 (NumericMatrix x, NumericMatrix y){
int out_length = x.nrow();
NumericVector out(out_length);
for (int i = 0 ; i < out_length; i++){
NumericVector v1 = x.row(i);
NumericVector v2 = y.row(i);
double d = dist2(v1, v2);
out(i) = d;
}
return (out) ;
}
Running in R:
library(Rcpp)
sourceCpp("calc_L2.cpp")
calc_l2(v1, v2)
The Rcpp code by Marius becomes about 10 times faster if you inline the function call, but it's still about as fast as sqrt(rowSums((m1-m2)^2))
:
library(Rcpp)
sourceCpp("r/calc_L2.cpp") # original by Marius
cppFunction('NumericVector calc_l2_inline(NumericMatrix x,NumericMatrix y){
int nrow=x.nrow();
NumericVector out(nrow);
for(int i=0;i<nrow;i++)out(i)=sqrt(sum(pow(x.row(i)-y.row(i),2)));
return(out);
}')
ncol=10
nrow=1e5
m1=matrix(runif(ncol*nrow),nrow)
m2=matrix(runif(ncol*nrow),nrow)
microbenchmark(times=100,
rowSums={sqrt(rowSums((m1-m2)^2))},
`Rfast::rowsums`={sqrt(Rfast::rowsums((m1-m2)^2))},
Rcpp_original={calc_l2(m1,m2)},
Rcpp_inlined={calc_l2_inline(m1,m2)},
sapply_dotproduct={sapply(1:nrow(m1),function(i){q=m1[i,]-m2[i,];sqrt(q%*%q)})},
sapply_regular={sapply(1:nrow(m1),function(i)sqrt(sum((m1[i,]-m2[i,])^2)))},
for_loop={o=numeric(nrow);for(i in 1:nrow)o[i]=sqrt(sum((m1[i,]-m2[i,])^2))},
mapply={r=row(m1);mapply(function(x,y)sqrt(sum((x-y)^2)),split(m1,r),split(m2,r))}
)
Result:
Unit: milliseconds
expr min lq mean median uq max neval
rowSums 4.295901 4.708260 5.761508 5.461944 6.496243 10.247036 100
Rfast::rowsums 3.004092 3.327411 4.135796 3.451392 5.731450 6.877907 100
Rcpp_original 37.777999 39.480606 51.307351 43.006943 61.729813 176.979826 100
Rcpp_inlined 4.232740 4.283238 4.379944 4.332177 4.400327 5.462128 100
sapply_dotproduct 473.272534 538.187874 615.304276 611.288368 669.466721 875.786952 100
sapply_regular 197.353688 233.303991 275.858154 260.292042 302.541703 536.336035 100
for_loop 130.624967 153.188579 195.365026 190.774655 219.141935 526.906898 100
mapply 603.384269 662.399258 717.631411 695.372090 738.274394 1038.938268 100
This is a fast way to calculate the distance of a vector v
to each row of matrix m
:
sqrt(rowSums(m^2)+sum(v^2)-2*m%*%as.matrix(v)[,1])
Or if you have one matrix that has m
rows and another matrix that has n
rows, then the following is a fast way to calculate an m
by n
distance matrix between each combination of rows in the matrices: (using tcrossprod(m1,m2)
instead of m1%*%t(m2)
made the code about 1% faster):
sqrt(outer(rowSums(m1^2),rowSums(m2^2),"+")-2*tcrossprod(m1,m2))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.