简体   繁体   English

apache-flink KMeans对UnsortedGrouping的操作

[英]apache-flink KMeans operation on UnsortedGrouping

I have a flink DataSet (read from a file) that contains sensor readings from many different sensors. 我有一个flink数据集(从文件中读取),其中包含来自许多不同传感器的传感器读数。 I use flinks groupBy() method to organize the data as an UnsortedGrouping per sensor. 我使用flinks groupBy()方法将每个传感器的数据组织为UnsortedGrouping。 Next, I would like to run the KMeans algorithm on every UnsortedGrouping in my DataSet in a distributed way. 接下来,我想以分布式方式在我的数据集中的每个UnsortedGrouping上运行KMeans算法。

My question is, how to efficiently implement this functionality using flink. 我的问题是,如何使用flink有效地实现此功能。 Below is my current implementation: I have written my own groupReduce() method that applies the flink KMeans algorithm to every UnsortedGrouping. 下面是我当前的实现:我编写了自己的groupReduce()方法,该方法将flink KMeans算法应用于每个UnsortedGrouping。 This code works, but seems very slow and uses high amounts of memory. 这段代码可以运行,但是看起来很慢并且占用大量内存。

I think this has to do with the amount of data reorganization I have to do. 我认为这与我要做的数据重组量有关。 Multiple data conversions have to be performed to make the code run, because I don't know how to do it more efficiently: 必须执行多次数据转换才能使代码运行,因为我不知道如何更有效地执行它:

  • UnsortedGrouping to Iterable (start of groupReduce() method) UnsortedGrouping为Iterable(groupReduce()方法的开始)
  • Iterable to LinkedList (need this to use the fromCollection() method) 可迭代到LinkedList(需要使用fromCollection()方法)
  • LinkedList to DataSet (required as input to KMeans) LinkedList到DataSet(需要作为KMeans的输入)
  • resulting KMeans DataSet to LinkedList (to be able to iterate for Collector) 产生的KMeans DataSet到LinkedList(以便可以为Collector进行迭代)

Surely, there must be a more efficient and performant way to implement this? 当然,必须有一种更高效,更高效的方法来实现这一目标吗? Can anybody show me how to implement this in a clean and efficient flink way? 谁能告诉我如何以一种干净有效的flink方式实现这一目标?

// *************************************************************************
// VARIABLES
// *************************************************************************

static int numberClusters = 10;
static int maxIterations = 10;
static int sensorCount = 117;
static ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

// *************************************************************************
// PROGRAM
// *************************************************************************

public static void main(String[] args) throws Exception {

    final long startTime = System.currentTimeMillis();

    String fileName = "C:/tmp/data.nt";
    DataSet<String> text = env.readTextFile(fileName);

    // filter relevant DataSet from text file input
    UnsortedGrouping<Tuple2<Integer,Point>> points = text
            .filter(x -> x.contains("Value") && x.contains("valueLiteral")).filter(x -> !x.contains("#string"))
            .map(x -> new Tuple2<Integer, Point>(
                    Integer.parseInt(x.substring(x.indexOf("_") + 1, x.indexOf(">"))) % sensorCount,
                    new Point(Double.parseDouble(x.split("\"")[1]))))
            .filter(x -> x.f0 < 10)
            .groupBy(0);

    DataSet<Tuple2<Integer, Point>> output = points.reduceGroup(new DistinctReduce());
    output.print();

    // print the execution time
    final long endTime = System.currentTimeMillis();
    System.out.println("Total execution time: " + (endTime - startTime) + "ms");
}

public static class DistinctReduce implements GroupReduceFunction<Tuple2<Integer, Point>, Tuple2<Integer, Point>> {

    private static final long serialVersionUID = 1L;

    @Override public void reduce(Iterable<Tuple2<Integer, Point>> in, Collector<Tuple2<Integer, Point>> out) throws Exception {

        AtomicInteger counter = new AtomicInteger(0);
        List<Point> pointsList = new LinkedList<Point>();

        for (Tuple2<Integer, Point> t : in) {
            pointsList.add(new Point(t.f1.x));
        }
        DataSet<Point> points = env.fromCollection(pointsList);

        DataSet<Centroid> centroids = points
                .distinct()
                .first(numberClusters)
                .map(x -> new Centroid(counter.incrementAndGet(), x));
        //DataSet<String> test = centroids.map(x -> String.format("Centroid %s", x)); //test.print();

        IterativeDataSet<Centroid> loop = centroids.iterate(maxIterations); 
        DataSet<Centroid> newCentroids = points // compute closest centroid for each point
                .map(new SelectNearestCenter()).withBroadcastSet(loop,"centroids") // count and sum point coordinates for each centroid
                .map(new CountAppender())
                .groupBy(0)
                .reduce(new CentroidAccumulator()) // compute new centroids from point counts and coordinate sums
                .map(new CentroidAverager());

        // feed new centroids back into next iteration
        DataSet<Centroid> finalCentroids = loop.closeWith(newCentroids);

        DataSet<Tuple2<Integer, Point>> clusteredPoints = points // assign points to final clusters
                .map(new SelectNearestCenter()).withBroadcastSet(finalCentroids, "centroids");

        // emit result System.out.println("Results from the KMeans algorithm:");
        clusteredPoints.print();

        // emit all unique strings.
        List<Tuple2<Integer, Point>> clusteredPointsList = clusteredPoints.collect();
        for(Tuple2<Integer, Point> t : clusteredPointsList) {
            out.collect(t);
        }
    }
}

You have to group the data points and the centroids first. 您必须先对数据点和质心进行分组。 Then you iterate over the centroids and co groups them with the data points. 然后,您遍历质心并将它们与数据点组合在一起。 For each point in a group you assign it to the closest centroid. 对于组中的每个点,可将其分配给最近的质心。 Then you group on the initial group index and the centroid index to reduce all points assigned to the same centroid. 然后,对初始组索引和质心索引进行分组,以减少分配给同一质心的所有点。 That will be the result of one iteration. 那将是一次迭代的结果。

The code could look the following way: 该代码可能如下所示:

DataSet<Tuple2<Integer, Point>> groupedPoints = ...

DataSet<Tuple2<Integer, Centroid>> groupCentroids = ...

IterativeDataSet<Tuple2<Integer, Centroid>> groupLoop = groupCentroids.iterate(10);

DataSet<Tuple2<Integer, Centroid>> newGroupCentroids = groupLoop
    .coGroup(groupedPoints).where(0).equalTo(0).with(new CoGroupFunction<Tuple2<Integer,Centroid>, Tuple2<Integer,Point>, Tuple4<Integer, Integer, Point, Integer>>() {
    @Override
    public void coGroup(Iterable<Tuple2<Integer, Centroid>> centroidsIterable, Iterable<Tuple2<Integer, Point>> points, Collector<Tuple4<Integer, Integer, Point, Integer>> out) throws Exception {
        // cache centroids
        List<Tuple2<Integer, Centroid>> centroids = new ArrayList<>();
        Iterator<Tuple2<Integer, Centroid>> centroidIterator = centroidsIterable.iterator();

        for (Tuple2<Integer, Point> pointTuple : points) {
            double minDistance = Double.MAX_VALUE;
            int minIndex = -1;
            Point point = pointTuple.f1;

            while (centroidIterator.hasNext()) {
                centroids.add(centroidIterator.next());
            }

            for (Tuple2<Integer, Centroid> centroidTuple : centroids) {
                Centroid centroid = centroidTuple.f1;
                double distance = point.euclideanDistance(centroid);

                if (distance < minDistance) {
                    minDistance = distance;
                    minIndex = centroid.id;
                }
            }

            out.collect(Tuple4.of(minIndex, pointTuple.f0, point, 1));
        }
    }})
    .groupBy(0, 1).reduce(new ReduceFunction<Tuple4<Integer, Integer, Point, Integer>>() {
        @Override
        public Tuple4<Integer, Integer, Point, Integer> reduce(Tuple4<Integer, Integer, Point, Integer> value1, Tuple4<Integer, Integer, Point, Integer> value2) throws Exception {
            return Tuple4.of(value1.f0, value1.f1, value1.f2.add(value2.f2), value1.f3 + value2.f3);
        }
    }).map(new MapFunction<Tuple4<Integer,Integer,Point,Integer>, Tuple2<Integer, Centroid>>() {
        @Override
        public Tuple2<Integer, Centroid> map(Tuple4<Integer, Integer, Point, Integer> value) throws Exception {
            return Tuple2.of(value.f1, new Centroid(value.f0, value.f2.div(value.f3)));
        }
    });

DataSet<Tuple2<Integer, Centroid>> result = groupLoop.closeWith(newGroupCentroids);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM