选择R中最远的n个点

Question

Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other? 给定一组xy坐标，我如何选择n个点使得那些n点彼此距离最远？

An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant): 对于大数据集可能不会做得太好的低效方法如下（确定最远的1000个中的20个点）：

xy <- cbind(rnorm(1000),rnorm(1000))

n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
    subset <- xy[sample(1:nrow(xy),n),]
    avg <- mean(dist(subset))
    if (avg > bestavg) {
        bestavg <- avg
        bestSet <- subset
    }
}

Answer 1

This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix. 此代码基于Pascal的代码，删除距离矩阵中具有最大行和的点。

m2 <- function(xy, n){

    subset <- xy

    alldist <- as.matrix(dist(subset))

    while (nrow(subset) > n) {
        cdists = rowSums(alldist)
        closest <- which(cdists == min(cdists))[1]
        subset <- subset[-closest,]
        alldist <- alldist[-closest,-closest]
    }
    return(subset)
}

Run on a Gaussian cloud, where m1 is @pascal's function: 在高斯云上运行，其中m1是@ pascal的函数：

> set.seed(310366)
> xy <- cbind(rnorm(1000),rnorm(1000))
> m1s = m1(xy,20)
> m2s = m2(xy,20)

See who did best by looking at the sum of the interpoint distances: 通过查看点间距离的总和来查看谁做得最好：

> sum(dist(m1s))
[1] 646.0357
> sum(dist(m2s))
[1] 811.7975

Method 2 wins! 方法2获胜！ And compare with a random sample of 20 points: 并与随机抽样的20分进行比较：

> sum(dist(xy[sample(1000,20),]))
[1] 349.3905

which does pretty poorly as expected. 这与预期相当糟糕。

So what's going on? 发生什么了？ Let's plot: 我们的情节：

> plot(xy,asp=1)
> points(m2s,col="blue",pch=19)
> points(m1s,col="red",pch=19,cex=0.8)

在此输入图像描述

Method 1 generates the red points, which are evenly spaced out over the space. 方法1生成红点，红点在空间上均匀分布。 Method 2 creates the blue points, which almost define the perimeter. 方法2创建蓝点，几乎定义周长。 I suspect the reason for this is easy to work out (and even easier in one dimension...). 我怀疑这个原因很容易解决（在一个维度上更容易......）。

Using a bimodal pattern of initial points also illustrates this: 使用双峰模式的初始点也说明了这一点：

在此输入图像描述

and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling: 并且方法2产生的总和距离远远大于方法1，但两者都比随机抽样更好：

> sum(dist(m1s2))
[1] 958.3518
> sum(dist(m2s2))
[1] 1206.439
> sum(dist(xy2[sample(1000,20),]))
[1] 574.34

Answer 2

Following @Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. 按照@Spacedman的建议，我写了一个函数，从最近的一对中删除一个点，直到剩下所需的点数。 It seems to work well, however, it slows down pretty quickly as you add points. 它似乎运行良好，然而，当你添加点时它会很快减速。

xy <- cbind(rnorm(1000),rnorm(1000))

n <- 20

subset <- xy

alldist <- as.matrix(dist(subset))
diag(alldist) <- NA
alldist[upper.tri(alldist)] <- NA

while (nrow(subset) > n) {
    closest <- which(alldist == min(alldist,na.rm=T),arr.ind=T)
    subset <- subset[-closest[1,1],]
    alldist <- alldist[-closest[1,1],-closest[1,1]]
}

选择R中最远的n个点

问题描述

2 个解决方案

解决方案1
9 已采纳 2014-03-04 10:33:19

解决方案2
0 2014-03-03 18:22:35

选择R中最远的n个点

问题描述

2 个解决方案

解决方案1 9 已采纳 2014-03-04 10:33:19

解决方案2 0 2014-03-03 18:22:35

解决方案1
9 已采纳 2014-03-04 10:33:19

解决方案2
0 2014-03-03 18:22:35