简体   繁体   中英

Weighted Euclidean Distance in R

I'd like to create a distance-matrix with weighted euclidean distances from a data frame. The weights will be defined in a vector. Here's an example:

library("cluster")

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df <- data.frame(a,b,c)

weighting <- c(1, 2, 3)

dm <- as.matrix(daisy(df, metric = "euclidean", weights = weighting))

I've searched everywhere and can't find a package or solution to this in R. The 'daisy' function within the 'cluster' package claims to support weighting, but the weights don't seem to be applied and it just spits out regular euclid. distances.

Any ideas Stack Overflow?

We can use @WalterTross' technique of scaling by multiplying each column by the square root of its respective weight first:

newdf <- sweep(df, 2, weighting, function(x,y) x * sqrt(y))
as.matrix(daisy(newdf, metric="euclidean"))

But just in case you would like to have more control and understanding of what euclidean distance is, we can write a custom function. As a note, I have chosen a different weighting method. :

xpand <- function(d) do.call("expand.grid", rep(list(1:nrow(d)), 2))
euc_norm <- function(x) sqrt(sum(x^2))
euc_dist <- function(mat, weights=1) {
  iter <- xpand(mat)
  vec <- mapply(function(i,j) euc_norm(weights*(mat[i,] - mat[j,])), 
                iter[,1], iter[,2])
  matrix(vec,nrow(mat), nrow(mat))
}

We can test the result by checking against the daisy function:

#test1
as.matrix(daisy(df, metric="euclidean"))
#          1        2        3        4        5
# 1 0.000000 1.732051 4.898979 5.196152 6.000000
# 2 1.732051 0.000000 3.316625 3.464102 4.358899
# 3 4.898979 3.316625 0.000000 1.732051 3.464102
# 4 5.196152 3.464102 1.732051 0.000000 1.732051
# 5 6.000000 4.358899 3.464102 1.732051 0.000000

euc_dist(df)
#          [,1]     [,2]     [,3]     [,4]     [,5]
# [1,] 0.000000 1.732051 4.898979 5.196152 6.000000
# [2,] 1.732051 0.000000 3.316625 3.464102 4.358899
# [3,] 4.898979 3.316625 0.000000 1.732051 3.464102
# [4,] 5.196152 3.464102 1.732051 0.000000 1.732051
# [5,] 6.000000 4.358899 3.464102 1.732051 0.000000

The reason I doubt Walter's method is because firstly, I've never seen weights applied by their square root, it's usually 1/w . Secondly, when I apply your weights to my function, I get a different result.

euc_dist(df, weights=weighting) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM