简体   繁体   English

在每次迭代之前使用 Apache Flink 中的 DataSet API 计算变量

[英]Computing variables before each iteration using the DataSet API in Apache Flink

I am working with the clustering example provided with Flink ( Kmeans ) and trying to extend the functionality of it.我正在使用 Flink ( Kmeans ) 提供的集群示例并尝试扩展它的功能。 The goal is to reduce the number of distance computations by computing a multidimensional-array consisting of the distances between each centroid, such that the distances can be found in a double[][] array.目标是通过计算由每个质心之间的距离组成的多维数组来减少距离计算的次数,以便可以在double[][]数组中找到距离。 This array must be computed at the beginning of each iteration and broadcasted, when the points are assigned clusters.这个数组必须在每次迭代开始时计算并广播,当点被分配集群时。

I have tried the following:我尝试了以下方法:

public static final class computeCentroidInterDistance implements MapFunction<Tuple2<Centroid, Centroid>, Tuple3<Integer, Integer, Double>> {

    @Override
    public Tuple3<Integer, Integer, Double> map(Tuple2<Centroid, Centroid> centroid) throws Exception {
        return new Tuple3<>(centroid.f0.id, centroid.f1.id, centroid.f0.euclideanDistance(centroid.f1));
    }
}

DataSet<Centroid> centroids = getCentroidDataSet(..);

DataSet<Tuple3<Integer, Integer, Double>> distances = centroids
    .crossWithTiny(centroids)
    .map(new computeCentroidInterDistance());

However, I dont see how the distances DataSet can be used for my use-case as this is not returned in any specific order that can be used to lookup the distances between two different centroids.但是,我不知道如何将距离 DataSet 用于我的用例,因为它不会以任何可用于查找两个不同质心之间距离的特定顺序返回。 Is there a better way of doing this?有没有更好的方法来做到这一点?

DataSets are inherently unordered and sharded, both which are not suited for your use case.数据集本质上是无序和分片的,两者都不适合您的用例。

What you want to do is to first collect all centroids in one method invocation.您想要做的是首先在一个方法调用中收集所有质心。

DataSet<double[][]> matrix = centroids.reduceGroup(...)

Within the reduceGroup you have access to all elements and you can perform the calculation.在 reduceGroup 中,您可以访问所有元素并可以执行计算。 The output should be your double[][] matrix.输出应该是你的 double[][] 矩阵。

The matrix can then be distributed with a broadcast .然后可以通过广播分发矩阵。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM