简体   繁体   中英

K-mean Clustering R-Tree boost

I'm using R-Tree boost . I added a hundred thousand points in r-tree boost. Now I want to cluster and group my points like this link . It seems like that I should calculate k-mean value from points. How is it possible to calculate k-mean value from r-tree points geometry.

There are various clustering algorithms having different properties and inputs. What needs to be considered before choosing an algorithm is what do you want to achieve. k-means referred by you in the question aims to partition set of points into k clusters. So the input is the desired number of clusters. On the other hand, the algorithm described in the blog you linked, a variant of greedy clustering algorithm aims to partition set of points into circular clusters of some size. The input is the radius of the desired cluster.

There are various algorithms performing k-means clustering used for different data and applications like separating 2 n-dimensional subsets with hyperplane or clustering with Voronoi diagram (Lloyd's algorithm) often called k-means algorithm. There are also density-based clustering algorithms mentioned by @Anony-Mousse in the comments under your question.

In the article, you mentioned it's a hierarchical version of greedy clustering. They have to calculate the clusters for multiple zoom levels and to avoid analyzing all of the points each time they use the centroids of clusters from the previously analyzed level as a source of points for clustering for the next level. However, in this answer, I'll show how to implement this algorithm for one level only. So the input will be a set of points and a size of cluster as a radius. If you need hierarchical version you should calculate centroids of the output clusters and use them as the input of the algorithm for the next level.

Using Boost.Geometry R-tree the algorithm for one level (so not hierarchical) could be implemented like this (in C++11):

#include <boost/geometry.hpp>
#include <boost/geometry/index/rtree.hpp>

#include <boost/range/adaptor/indexed.hpp>
#include <boost/range/adaptor/transformed.hpp>

#include <iostream>
#include <vector>

namespace bg = boost::geometry;
namespace bgi = boost::geometry::index;

typedef bg::model::point<double, 2, bg::cs::cartesian> point_t;
typedef bg::model::box<point_t> box_t;
typedef std::vector<point_t> cluster_t;

// used in the rtree constructor with Boost.Range adaptors
// to generate std::pair<point_t, std::size_t> from point_t on the fly
template <typename First, typename Second>
struct pair_generator
{
    typedef std::pair<First, Second> result_type;
    template<typename T>
    inline result_type operator()(T const& v) const
    {
        return result_type(v.value(), v.index());
    }
};

// used to hold point-related information during clustering
struct point_data
{
    point_data() : used(false) {}
    bool used;
};

// find clusters of points using cluster radius r
void find_clusters(std::vector<point_t> const& points,
                   double r,
                   std::vector<cluster_t> & clusters)
{
    typedef std::pair<point_t, std::size_t> value_t;
    typedef pair_generator<point_t, std::size_t> value_generator;

    if (r < 0.0)
        return; // or return error

    // create rtree holding std::pair<point_t, std::size_t>
    // from container of points of type point_t
    bgi::rtree<value_t, bgi::rstar<4> >
        rtree(points | boost::adaptors::indexed()
                     | boost::adaptors::transformed(value_generator()));

    // create container holding point states
    std::vector<point_data> points_data(rtree.size());

    // for all pairs contained in the rtree
    for(auto const& v : rtree)
    {
        // ignore points that were used before
        if (points_data[v.second].used)
            continue;

        // current point
        point_t const& p = v.first;
        double x = bg::get<0>(p);
        double y = bg::get<1>(p);

        // find all points in circle of radius r around current point
        std::vector<value_t> res;
        rtree.query(
            // return points that are in a box enclosing the circle
            bgi::intersects(box_t{{x-r, y-r},{x+r, y+r}})
            // and were not used before
            // and are indeed in the circle
            && bgi::satisfies([&](value_t const& v){
                   return points_data[v.second].used == false
                       && bg::distance(p, v.first) <= r;
            }),
            std::back_inserter(res));

        // create new cluster
        clusters.push_back(cluster_t());
        // add points to this cluster and mark them as used
        for(auto const& v : res) {
            clusters.back().push_back(v.first);
            points_data[v.second].used = true;
        }
    }
}

int main()
{
    std::vector<point_t> points;

    for (double x = 0.0 ; x < 10.0 ; x += 1.0)
        for (double y = 0.0 ; y < 10.0 ; y += 1.0)
            points.push_back(point_t{x, y});

    std::vector<cluster_t> clusters;

    find_clusters(points, 3.0, clusters);

    for(size_t i = 0 ; i < clusters.size() ; ++i) {
        std::cout << "Cluster " << i << std::endl;
        for (auto const& p : clusters[i]) {
            std::cout << bg::wkt(p) << std::endl;
        }
    }
}

See also their implementation: https://github.com/mapbox/supercluster/blob/master/index.js#L216

Furthermore, take into account the remarks of @Anony-Mousse about the accuracy of distance calculation on the globe. The solution above is for cartesian coordinate system. If you want to use different coordinate system then you have to define point type differently, eg use bg::cs::spherical_equatorial<bg::degree> or bg::cs::geographic<bg::degree> instead of bg::cs::cartesian . You also have to generate the query bounding box differently. But bg::distance() will automatically return correct distance after changing the point type.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM