[英]Calculating all distances between one point and a group of points efficiently in R
First of all, I am new to R (I started yesterday). 首先,我是R的新手(我昨天开始)。
I have two groups of points, data
and centers
, the first one of size n
and the second of size K
(for instance, n = 3823
and K = 10
), and for each i
in the first set, I need to find j
in the second with the minimum distance. 我有两组点,
data
和centers
,第一个是大小为n
,第二个是大小为K
(例如, n = 3823
和K = 10
),对于第一组中的每个i
,我需要找到j
在第二个距离最小。
My idea is simple: for each i
, let dist[j]
be the distance between i
and j
, I only need to use which.min(dist)
to find what I am looking for. 我的想法很简单:对于每个
i
,让dist[j]
成为i
和j
之间的距离,我只需要使用which.min(dist)
来找到我要找的东西。
Each point is an array of 64
doubles, so 每个点都是
64
双打的数组,所以
> dim(data)
[1] 3823 64
> dim(centers)
[1] 10 64
I have tried with 我试过了
for (i in 1:n) {
for (j in 1:K) {
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
S[i] <- which.min(d)
}
which is extremely slow (with n = 200
, it takes more than 40s!!). 这是非常慢的(
n = 200
,需要超过40秒!)。 The fastest solution that I wrote is 我写的最快的解决方案是
distance <- function(point, group) {
return(dist(t(array(c(point, t(group)), dim=c(ncol(group), 1+nrow(group)))))[1:nrow(group)])
}
for (i in 1:n) {
d <- distance(data[i,], centers)
which.min(d)
}
Even if it does a lot of computation that I don't use (because dist(m)
computes the distance between all rows of m
), it is way more faster than the other one (can anyone explain why?), but it is not fast enough for what I need, because it will not be used only once. 即使做了很多,我不使用(因为计算的
dist(m)
计算各行之间的距离m
),它是一个比另一个更路快(任何人都可以解释为什么?),但它是不够快我的需要,因为它不会只使用一次。 And also, the distance
code is very ugly. 而且,
distance
代码非常难看。 I tried to replace it with 我试着替换它
distance <- function(point, group) {
return (dist(rbind(point,group))[1:nrow(group)])
}
but this seems to be twice slower. 但这似乎慢了两倍。 I also tried to use
dist
for each pair, but it is also slower. 我也试图为每对使用
dist
,但它也慢。
I don't know what to do now. 我现在不知道该怎么办。 It seems like I am doing something very wrong.
看来我做错了什么。 Any idea on how to do this more efficiently?
如何更有效地做到这一点?
ps: I need this to implement k-means by hand (and I need to do it, it is part of an assignment). ps:我需要这个来手动实现k-means(我需要这样做,它是一个赋值的一部分)。 I believe I will only need Euclidian distance, but I am not yet sure, so I will prefer to have some code where the distance computation can be replaced easily.
我相信我只需要欧几里得距离,但我还不确定,所以我更愿意有一些代码可以轻松替换距离计算。
stats::kmeans
do all computation in less than one second. stats::kmeans
在不到一秒的时间内完成所有计算。
Rather than iterating across data points, you can just condense that to a matrix operation, meaning you only have to iterate across K
. 您可以将其浓缩为矩阵运算,而不是迭代数据点,这意味着您只需要遍历
K
# Generate some fake data.
n <- 3823
K <- 10
d <- 64
x <- matrix(rnorm(n * d), ncol = n)
centers <- matrix(rnorm(K * d), ncol = K)
system.time(
dists <- apply(centers, 2, function(center) {
colSums((x - center)^2)
})
)
Runs in: 运行于:
utilisateur système écoulé
0.100 0.008 0.108
on my laptop. 在我的笔记本上。
rdist() is a R function from {fields} package which is able to calculate distances between two sets of points in matrix format quickly. rdist()是来自{fields}包的R函数,它能够快速地以矩阵格式计算两组点之间的距离。
https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html https://www.image.ucar.edu/~nychka/Fields/Help/rdist.html
Usage : 用法:
library(fields)
#generating fake data
n <- 5
m <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = d)
y <- matrix(rnorm(m * d), ncol = d)
rdist(x, y)
[,1] [,2] [,3] [,4] [,5]
[1,] 1.512383 3.053084 3.1420322 4.942360 3.345619
[2,] 3.531150 4.593120 1.9895867 4.212358 2.868283
[3,] 1.925701 2.217248 2.4232672 4.529040 2.243467
[4,] 2.751179 2.260113 2.2469334 3.674180 1.701388
[5,] 3.303224 3.888610 0.5091929 4.563767 1.661411
[6,] 3.188290 3.304657 3.6668867 3.599771 3.453358
[7,] 2.891969 2.823296 1.6926825 4.845681 1.544732
[8,] 2.987394 1.553104 2.8849988 4.683407 2.000689
[9,] 3.199353 2.822421 1.5221291 4.414465 1.078257
[10,] 2.492993 2.994359 3.3573190 6.498129 3.337441
You may want to have a look into the apply
functions. 您可能需要查看
apply
函数。
For instance, this code 例如,这段代码
for (j in 1:K)
{
d[j] <- sqrt(sum((centers[j,] - data[i,])^2))
}
Can easily be substituted by something like 很容易被类似的东西取代
dt <- data[i,]
d <- apply(centers, 1, function(x){ sqrt(sum(x-dt)^2)})
You can definitely optimise it more but you get the point I hope 你绝对可以更好地优化它,但你得到了我希望的观点
dist
works fast because is't vectorized and call internal C functions. dist
工作得很快,因为没有矢量化并调用内部C函数。
You code in loop could be vectorized in many ways. 循环中的代码可以通过多种方式进行矢量化。
For example to compute distance between data
and centers
you could use outer
: 例如,要计算
data
和centers
之间的距离,您可以使用outer
:
diff_ij <- function(i,j) sqrt(rowSums((data[i,]-centers[j,])^2))
X <- outer(seq_len(n), seq_len(K), diff_ij)
This gives you nx K
matrix of distances. 这给你
nx K
矩阵的距离。 And should be way faster than loop. 并且应该比循环更快。
Then you could use max.col
to find maximum in each row (see help, there are some nuances when are many maximums). 然后你可以使用
max.col
在每一行中找到最大值(参见帮助,当有很多最大值时有一些细微差别)。 X
must be negate cause we search for minimum. X
必须是否定因为我们搜索最小值。
CL <- max.col(-X)
To be efficient in R you should vectorized as possible. 为了提高效率,你应该尽可能地进行矢量化。 Loops could be in many cases replaced by vectorized substitute.
在许多情况下,循环可以用矢量化替代来代替。 Check help for
rowSums
(which describe also rowMeans
, colSums
, rowSums
), pmax
, cumsum
. 检查
rowSums
(也描述rowMeans
, colSums
, rowSums
), pmax
, cumsum
。 You could search SO, eg https://stackoverflow.com/search?q=[r]+avoid+loop (copy&paste this link, I don't how to make it clickable) for some examples. 您可以搜索SO,例如https://stackoverflow.com/search?q= [r]+avoid+ loop (复制并粘贴此链接,我不知道如何使其可点击)以获取一些示例。
My solution: 我的解决方案
# data is a matrix where each row is a point
# point is a vector of values
euc.dist <- function(data, point) {
apply(data, 1, function (row) sqrt(sum((point - row) ^ 2)))
}
You can try it, like: 你可以尝试一下,比如:
x <- matrix(rnorm(25), ncol=5)
euc.dist(x, x[1,])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.