简体繁体 English

使用网格划分在2D中进行最近邻居搜索

[英]Nearest neighbor search in 2D using a grid partitioning

原文 2013-04-04 19:39:29 0 4 algorithm/ geometry/ nearest-neighbor

I have a fairly large set of 2D points (~20000) in a set, and for each point in the xy plane want to determine which point from the set is closest. 我在一个集合中有一个相当大的2D点集（〜20000个），对于xy平面中的每个点都想确定集合中哪个点最接近。 (Actually, the points are of different types, and I just want to know which type is closest. And the xy plane is a bitmap, say 640x480.) （实际上，这些点具有不同的类型，我只想知道哪种类型最接近。而xy平面是一个位图，例如640x480。）

From this answer to the question " All k nearest neighbors in 2D, C++ " I got the idea to make a grid. 从这个问题的答案 “ 二维中所有k最近的邻居，C ++ ”中我得到了制作网格的想法。 I created n*m C++ vectors and put the points in the vector, depending on which bin it falls into. 我创建了n * m个C ++向量并将这些点放入向量中，具体取决于它属于哪个bin。 The idea is that you only have to check the distance of the points in the bin, instead of all points. 这个想法是，您只需要检查箱中点的距离，而不是所有点。 If there is no point in the bin, you continue with the adjacent bins in a spiralling manner. 如果料仓中没有任何点，则以螺旋方式继续相邻的料仓。

Unfortunately, I only read Oli Charlesworth's comment afterwards: 不幸的是，此后我只阅读了Oli Charlesworth的评论：

Not just adjacent, unfortunately (consider that points in the cell two to the east may be closer than points in the cell directly north-east, for instance; this problem gets much worse in higher dimensions). 不幸的是，不仅仅是相邻的（例如，考虑到第二个单元中位于东方的点可能比直接位于东北中的单元中的点更近；在更高维度上，这个问题变得更加严重）。 Also, what if the neighbouring cells happen to have less than 10 points in them? 此外，如果相邻单元格中的点数少于10怎么办？ In practice, you will need to "spiral out". 在实践中，您将需要“振作起来”。

Fortunately, I already had the spiraling code figured out (a nice C++ version here , and there are other versions in the same question). 幸运的是，我已经弄清楚了不断增加的代码（这里是一个不错的C ++版本，同一问题中还有其他版本）。 But I'm still left with the problem: 但是我仍然有问题：

If I find a hit in a cell, there could be a closer hit in an adjacent cell (yellow is my probe, red is the wrong choice, green the actual closest point): 如果我在某个单元格中找到一个命中点，则可能在相邻的单元格中有一个更近的命中点（黄色是我的探针，红色是错误的选择，绿色是实际的最近点）：
If I find a hit in an adjacent cell, there could be a hit in a cell 2 steps away, as Oli Charlesworth remarked: 如果我在相邻单元格中找到匹配项，则可能会在2步之遥的单元格中有匹配项，正如Oli Charlesworth所说：
But even worse, if I find a hit in a cell two steps away, there could still be a closer hit in a hit three steps away! 但更糟糕的是，如果我在相隔两步的单元格中找到一个命中，那么在相隔三步的命中中仍然会有更近的命中！ That means I'd have to consider all cells with dx,dy= -3...3, or 49 cells! 这意味着我必须考虑dx，dy = -3 ... 3或49个单元格的所有单元格！

Now, in practice this won't happen often, because I can choose my bin size so the cells are filled enough. 现在，实际上这不会经常发生，因为我可以选择我的垃圾箱大小，以使单元格充满。 Still, I'd like to have a correct result, without iterating over all points. 不过，我希望得到正确的结果，而不必遍历所有要点。

So how do I find out when to stop "spiralling" or searching? 那么，如何确定何时停止“精神焕发”或搜索呢？ I heard there is an approach with multiple overlapping grids, but I didn't quite understand it. 我听说有一种方法有多个重叠的网格，但是我不太理解。 Is it possible to salvage this grid technique? 可以挽救这种网格技术吗？

4 个解决方案

Since the dimensions of your bitmap are not large and you want to calculate the closest point for every (x,y) , you can use dynamic programming. 由于位图的尺寸不大，并且您想为每个 (x,y)计算最接近的点，因此可以使用动态编程。

Let V[i][j] be the distance from (i,j) to the closest point in the set, but considering only the points in the set that are in the "rectangle" [(1, 1), (i, j)]. 令V[i][j]为从(i,j)到集合中最接近点的距离，但仅考虑集合中位于“矩形” [（1，1），（i， j）]。

Then V[i][j] = 0 if there is a point in (i, j) , or V[i][j] = min(V[i'][j'] + dist((i, j), (i', j'))) where (i', j') is one of the three neighbours of (i,j) : 然后V[i][j] = 0如果在一个点(i, j) ，或V[i][j] = min(V[i'][j'] + dist((i, j), (i', j'))) ，其中(i', j')是(i,j)的三个邻居之一：

ie 即

(i - 1, j)
(i, j - 1)
(i - 1, j - 1)

This gives you the minimum distance, but only for the "upper left" rectangle. 这为您提供了最小距离，但仅适用于“左上方”矩形。 We do the same for the "upper right", "lower left", and "lower right" orientations, and then take the minimum. 我们对“右上”，“左下”和“右下”方向执行相同的操作，然后取最小值。

The complexity is O(size of the plane), which is optimal. 复杂度为O（平面大小），这是最佳的。

For you task usually a Point Quadtree is used, especially when the points are not evenly distributed. 对于您的任务，通常使用Point Quadtree （ Point Quadtree ），尤其是在点分布不均匀时。

To save main memory you als can use a PM or PMR-Quadtree which uses buckets. 要保存主内存，您可以使用使用存储桶的PM或PMR-Quadtree。

You search in your cell and in worst case all quad cells surounding the cell. 您在单元格中搜索，在最坏的情况下，所有四格单元格都围绕该单元格。

You can also use a kd tree . 您还可以使用kd tree 。

One solution would be to construct multiple partitionings with different grid sizes. 一种解决方案是构造具有不同网格大小的多个分区。

Assume you create partitions at levels 1,2,4,8,.. 假设您在1,2,4,8，..级创建分区。

Now, search for a point in grid size 1 (you are basically searching in 9 squares). 现在，搜索网格大小为1的点（您基本上以9个正方形进行搜索）。 If there is a point in the search area and if distance to that point is less than 1, stop. 如果搜索区域中有一个点，并且到该点的距离小于1，则停止。 Otherwise move on to the next grid size. 否则，继续下一个网格尺寸。

The number of grids you need to construct is about twice as compared to creating just one level of partitioning. 与仅创建一个分区级别相比，您需要构建的网格数量约为两倍。

A solution im trying 正在尝试的解决方案

First make a grid such that you have an average of say 1 (more if you want larger scan) points per box. 首先创建一个网格，使每个框的平均点数为1（如果需要更大的扫描量，则更多）。
Select the center box. 选择中心框。 Continue selecting neighbor boxes in a circular manner until you find at least one neighbor. 继续以循环方式选择邻居框，直到找到至少一个邻居。 At this point you can have 1 or 9 or so on boxes selected 此时，您可以选择1或9个左右的框
Select one more layer of adjacent boxes 选择一层相邻的盒子
Now you have a fairly small list of points, usually not more than 10 which you can punch into the distance formula to find the nearest neighbor. 现在您有了一个很小的点列表，通常不超过10个点，您可以将其打入距离公式以找到最近的邻居。

Since you have on average 1 points per box, you will mostly be selecting 9 boxes and comparing 9 distances. 由于每个框平均有1个点，因此您通常会选择9个框并比较9个距离。 Can adjust grid size according to your dataset properties to achieve better results. 可以根据数据集属性调整网格大小以获得更好的结果。

Also, if your data has a lot of variance, you can try 2 levels of grid (or even more) so if selection works and returns more than 50 points in a single query, start a next grid search with a grid 1/10th the size ... 另外，如果您的数据有很大的差异，则可以尝试使用2级网格（甚至更多级），因此，如果选择有效并且在单个查询中返回50个以上的点，请使用网格1/10的网格开始下一个网格搜索。大小...