Speeding up algorithm to find the closest items in two data sets of 2D points

Question

Hi I'm looking to improve on this algorithm that is really slow. It should just return a point which is one of a pair of the two closest points out of both data sets.

The method I've used is just brute force, clearly testing distance between each set of points. There must be a better way.

cv::Point FindClosesedValue(std::vector<cv::Point> & point_a, std::vector<cv::Point> point_b)
{
  double lowest_distance = std::numeric_limits<double>::max();
  cv::Point best_point;
  for (cv::Point & a : point_a)
  {
    for (cv::Point & b : point_b)
    {
      double distance = CvFunctions::DistanceSquared(b, a);
      if (distance < lowest_distance)
      {
        lowest_distance = distance;
        best_point = a;
      }
    }
  }

  return best_point;
}

Please can someone point me in the right way to speed up this code, hopefully by order of magnitudes. Example code would be amazing.

Answer 1

I've worked on a similar problem where I managed to get a 100x speedup, but it was dependent on the data.

If you can pre-sort one set of points into a grid of tiles, you can make use of the tile size to narrow down the points you need to test. A given point will have a minimum and maximum distance to any point in a specific tile. You can use these minimum and maximum distances to bound which tiles you check, thus avoiding checking against points in far away tiles.

Once the points are divided into tiles, you can lookup which tile a new point would fall into, and start there. Depending on your data, the tile might be empty of pre-sorted points. Initially you just want to check the first tile, and surrounding ones until you find any point. That point will give you an approximation of the minimum distance. Once you know the minimum distance between your chosen point and the point that was found, you can continue checking all points in tiles up until the minimum distance between your chosen point and any point in a given tile is greater than your found minimum. Any points in tiles farther away can not be closer than a point you've already found. That minimum distance is of course updated if you find new closer points.

The sorting step is O(n), and the lookup step is bounded between n and n^2, so the expected time should be at most O(n^2), and likely much better, possibly close to linear if you have a suitable distribution of points.

As for tile size, I found choosing the tiles so the number of points in each tile was roughly equal to the number of tiles covering the data set yielded optimal runtimes. You can probably do better with a hierarchy of tiles, but I never got that complicated with my solution.

Answer 2

This is a nice problem. The usual recursive closest pair of points algorithm clearly isn't going to work, as the two point sets may be clustered in different areas of the space.

You can still solve this in O(nlogn) time though. Simply create a kd-tree (k=2) of all the points in one set, and query it with all the points from the other set.

Speeding up algorithm to find the closest items in two data sets of 2D points

Question

2 answers

solution1
1 2015-09-18 10:31:26

solution2
0 2015-09-18 11:16:15

Speeding up algorithm to find the closest items in two data sets of 2D points

Question

2 answers

solution1 1 2015-09-18 10:31:26

solution2 0 2015-09-18 11:16:15

solution1
1 2015-09-18 10:31:26

solution2
0 2015-09-18 11:16:15