最近点算法|怎么改进呢？

Question

I wrote a k-means clustering algorithm and a color quantization algorithm. 我写了一个k-means聚类算法和一个颜色量化算法。 They work as expected in terms of results but I want to make them faster. 它们在结果方面按预期工作，但我想让它们更快。 In both implementations I need to solve a problem: there are two arrays of points in a 3D space, then for each point of the first array, you need to find the closest point from the second array. 在两个实现中，我需要解决一个问题：3D空间中有两个点阵列，然后对于第一个阵列的每个点，您需要找到距离第二个阵列最近的点。 I do it like this: 我是这样做的：

size_t closest_cluster_index;
double x_dif, y_dif, z_dif;
double old_distance;
double new_distance;

for (auto point = points.begin(); point != points.end(); point++)
{
    //FIX
    //as suggested by juvian
    //K = 1
    if (point != points.begin())
    {
        auto cluster = &(clusters[closest_cluster_index]);

        r_dif = cluster->r - point->r;
        g_dif = cluster->g - point->g;
        b_dif = cluster->b - point->b;

        new_distance = r_dif * r_dif + g_dif * g_dif + b_dif * b_dif;

        if (new_distance <= std::sqrt(old_distance) - ColorU8::differenceRGB(*(point - 1), *point))
        {
            old_distance = new_distance;
            //do sth with closest_cluster_index;
            continue;
        }
    }
    //END OF FIX

    old_distance = std::numeric_limits<double>::infinity();

    for (auto cluster = clusters.begin(); cluster != clusters.end(); cluster++)
    {
        x_dif = cluster->x - point->x;
        y_dif = cluster->y - point->y;
        z_dif = cluster->z - point->z;

        new_distance = x_dif * x_dif + y_dif * y_dif + z_dif * z_dif;

        if (new_distance < old_distance)
        {
            old_distance = new_distance;
            closest_cluster_index = cluster - clusters.begin();
        }
    }
    //do sth with: closest_cluster_index
}

How can I improve it? 我怎样才能改进它？ (I don't want to make it multithreaded or computed by GPU) （我不想让它多线程或由GPU计算）

Answer 1

There are multiple data structures for efficient nearest neighbour queries. 有效的最近邻居查询有多种数据结构。 For 3d, a kdtree works really well, and has a complexity of O(log n) for each query on average which would improve your current O(n). 对于3d， kdtree工作得很好，并且平均每个查询的复杂度为O（log n），这将改善您当前的O（n）。

So with this structure you can add all your points from clusters to it, and then for each point in points, you can use the structure to query the closest point. 因此，使用此结构，您可以将所有点从群集添加到它，然后对于每个点的点，您可以使用该结构查询最近的点。 For your particular case, a static kdtree is enough, as you don´t need to update points. 对于您的特定情况，静态kdtree就足够了，因为您不需要更新点。

Another approach : 另一种方法 ：

We can try to risk doing extra computations on some points in exchange for fewer on others. 我们可以尝试冒险在某些点上进行额外的计算，以换取其他点上的更少。 This method should work well with the following assumptions: 这种方法应该适用于以下假设：

The distance between a cluster with another is far 群集与另一群集之间的距离很远
The distance between a point and the adjacent point is low 点与相邻点之间的距离较小

I think these apply to your case because your clusters are few colors and your points come from a real image, which tends to have similar colors between adjacent pixels. 我认为这些适用于您的情况，因为您的聚类很少颜色，而您的点来自真实图像，相邻像素之间的颜色往往相似。

For each point, create a heap. 对于每个点，创建一个堆。 Instead of storing the closest cluster, store in the max heap the closest k clusters. 而不是存储最近的集群，而是在最大堆中存储最接近的k个集群。 When you move to the next point, we can use this information. 当您转到下一点时，我们可以使用此信息。 Let's call this point P and its kth closest cluster C. 让我们称这个点P及其第k个最近的聚类C.

Now for a new point P2, before comparing to all clusters we will check if the closest cluster to P2 is in our heap. 现在换一个新点P2，在与所有集群比较之前，我们将检查最接近P2的集群是否在我们的堆中。 This can only be true if the distance between any cluster from the heap and P2 is <= distance(P, C) - distance(P, P2). 只有当堆中的任何簇与P2之间的距离<=距离（P，C） - 距离（P，P2）时，才会出现这种情况。 When this holds true, we can check only in our heap instead of all clusters. 当这成立时，我们只能检查堆而不是所有簇。 When it is not true, we compare against all and rebuild our heap and P will be P2. 当它不是真的时，我们将比较所有并重建我们的堆，P将是P2。

You will need to try out different values of k to see if it improves. 您需要尝试不同的k值才能看出它是否有所改善。 For the case of K = 2, might be worth avoiding the added complexity of a heap and just use variables. 对于K = 2的情况，可能值得避免增加堆的复杂性并且只使用变量。

最近点算法|怎么改进呢？

问题描述

1 个解决方案

解决方案1
6 已采纳 2018-11-12 14:43:42

最近点算法|怎么改进呢？

问题描述

1 个解决方案

解决方案1 6 已采纳 2018-11-12 14:43:42

解决方案1
6 已采纳 2018-11-12 14:43:42