简体   繁体   English

在R中有效地计算一个点和一组点之间的所有距离

[英]Calculating all distances between one point and a group of points efficiently in R

First of all, I am new to R (I started yesterday). 首先,我是R的新手(我昨天开始)。

I have two groups of points, data and centers , the first one of size n and the second of size K (for instance, n = 3823 and K = 10 ), and for each i in the first set, I need to find j in the second with the minimum distance. 我有两组点, datacenters ,第一个是大小为n ,第二个是大小为K (例如, n = 3823K = 10 ),对于第一组中的每个i ,我需要找到j在第二个距离最小。

My idea is simple: for each i , let dist[j] be the distance between i and j , I only need to use which.min(dist) to find what I am looking for. 我的想法很简单:对于每个i ,让dist[j]成为ij之间的距离,我只需要使用which.min(dist)来找到我要找的东西。

Each point is an array of 64 doubles, so 每个点都是64双打的数组,所以

> dim(data)
[1] 3823   64
> dim(centers)
[1] 10 64

I have tried with 我试过了

for (i in 1:n) {
  for (j in 1:K) {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
  }
  S[i] <- which.min(d)
}

which is extremely slow (with n = 200 , it takes more than 40s!!). 这是非常慢的( n = 200 ,需要超过40秒!)。 The fastest solution that I wrote is 我写的最快的解决方案是

distance <- function(point, group) {
  return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}

for (i in 1:n) {
  d <- distance(data[i,], centers)
  which.min(d)
}

Even if it does a lot of computation that I don't use (because dist(m) computes the distance between all rows of m ), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. 即使做了很多,我不使用(因为计算的dist(m)计算各行之间的距离m ),它是一个比另一个更路快(任何人都可以解释为什么?),但它是不够快我的需要,因为它不会只使用一次。 And also, the distance code is very ugly. 而且, distance代码非常难看。 I tried to replace it with 我试着替换它

distance <- function(point, group) {
  return (dist(rbind(point,group))[1:nrow(group)])
}

but this seems to be twice slower. 但这似乎慢了两倍。 I also tried to use dist for each pair, but it is also slower. 我也试图为每对使用dist ,但它也慢。

I don't know what to do now. 我现在不知道该怎么办。 It seems like I am doing something very wrong. 看来我做错了什么。 Any idea on how to do this more efficiently? 如何更有效地做到这一点?

ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). ps:我需要这个来手动实现k-means(我需要这样做,它是一个赋值的一部分)。 I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily. 我相信我只需要欧几里得距离,但我还不确定,所以我更愿意有一些代码可以轻松替换距离计算。 stats::kmeans do all computation in less than one second. stats::kmeans在不到一秒的时间内完成所有计算。

Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K . 您可以将其浓缩为矩阵运算,而不是迭代数据点,这意味着您只需要遍历K

# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)

system.time(
  dists <- apply(centers, 2, function(center) {
    colSums((x - center)^2)
})
)

Runs in: 运行于:

utilisateur     système      écoulé 
      0.100       0.008       0.108 

on my laptop. 在我的笔记本上。

rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly. rdist()是来自{fields}包的R函数,它能够快速地以矩阵格式计算两组点之间的距离。

https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html

Usage : 用法:

library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3

x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)

rdist(x, y)
          [,1]     [,2]      [,3]     [,4]     [,5]
 [1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
 [2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
 [3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
 [4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
 [5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
 [6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
 [7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
 [8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
 [9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441

You may want to have a look into the apply functions. 您可能需要查看apply函数。

For instance, this code 例如,这段代码

for (j in 1:K)
    {
    d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
    }

Can easily be substituted by something like 很容易被类似的东西取代

dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})

You can definitely optimise it more but you get the point I hope 你绝对可以更好地优化它,但你得到了我希望的观点

dist works fast because is't vectorized and call internal C functions. dist工作得很快,因为没有矢量化并调用内部C函数。
You code in loop could be vectorized in many ways. 循环中的代码可以通过多种方式进行矢量化。

For example to compute distance between data and centers you could use outer : 例如,要计算datacenters之间的距离,您可以使用outer

diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)

This gives you nx K matrix of distances. 这给你nx K矩阵的距离。 And should be way faster than loop. 并且应该比循环更快。

Then you could use max.col to find maximum in each row (see help, there are some nuances when are many maximums). 然后你可以使用max.col在每一行中找到最大值(参见帮助,当有很多最大值时有一些细微差别)。 X must be negate cause we search for minimum. X必须是否定因为我们搜索最小值。

CL <- max.col(-X)

To be efficient in R you should vectorized as possible. 为了提高效率,你应该尽可能地进行矢量化。 Loops could be in many cases replaced by vectorized substitute. 在许多情况下,循环可以用矢量化替代来代替。 Check help for rowSums (which describe also rowMeans , colSums , rowSums ), pmax , cumsum . 检查rowSums (也描述rowMeanscolSumsrowSums ), pmaxcumsum You could search SO, eg https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples. 您可以搜索SO,例如https://stackoverflow.com/search?q= [r]+avoid+ loop (复制并粘贴此链接,我不知道如何使其可点击)以获取一些示例。

My solution: 我的解决方案

# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
  apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}

You can try it, like: 你可以尝试一下,比如:

x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 计算直线与r中相交平面上所有点之间的距离 - Calculate distances between a line and all points on an intersecting plane in r 如何用R计算参考线(或点)与长/纬点数据框之间的道路网络距离? - How to calculate road network distances between a reference line(or point) and a dataframe of long/lat points with R? 计算r中点对之间的距离 - Calculate the distances between pairs of points in r 是否有用于查找点之间的相对距离的 R 函数? - Is there an R function for finding relative distances between points? 计算R点之间的所有时间差(而不是从一个点到下一个点) - Calculate all time differences between points in R (instead of from one point to the next one) R:一组物体之间的欧几里得距离 - R: Euclidian distances between objects in a group 计算R中点之间的距离 - Calculating the distance between points in R R:大数据的区别? 计算两个矩阵之间的最小距离 - R: Distm for big data? Calculating minimum distances between two matrices 计算所有可能的拉特和长组合之间的距离 - Calculating distances between all possible combinations of lats and longs 计算 DF 中的一个点与 R 中的所有其他点的距离 - Calculate distance of one point in DF with all other points in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM