Apache Flink: When using count() on DataSet only this job will be executed

Question

I've got a weird problem: When I'm using count() on a DataSet prior to other processing (BulkIteration) apache flink will only execute that plan for count() and skip my other operations. I couldn't find anything in the logs about that.

Further more, this doesn't happen in my IDE. There all the operations work. Only when I upload it via WebUI, this kind of problem occurs.

So: Is that a general problem? How can I solve that without having to compute the value count myself?

Thanks!

UPDATE:

The code does something similar like this (well, I know, that this example isn't well-designed for productive code, but it shows my problem).

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.aggregation.Aggregations;
import org.apache.flink.api.java.tuple.Tuple1;

import java.util.LinkedList;
import java.util.List;
import java.util.Random;

public class CountProblemExample {

    public static void main(String[] args) throws Exception {
        Random rnd = new Random();

        int randomNumber = 100000 + rnd.nextInt(100000);

        List<Double> doubles = new LinkedList<>();
        for (int i = 0; i < randomNumber; i++) {
            doubles.add(rnd.nextDouble());
        }

        ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Double> doubleDataSet = env.fromCollection(doubles);

        final int count = (int)doubleDataSet.count(); // In the UI there the code stops further execution

        DataSet<Double> avgSet = doubleDataSet
                .map(new MapFunction<Double, Tuple1<Double>>() {
                    @Override
                    public Tuple1<Double> map(Double value) throws Exception {
                        return new Tuple1<>(value);
                    }
                })
                .aggregate(Aggregations.SUM, 0)
                .map(new MapFunction<Tuple1<Double>, Double>() {
                    @Override
                    public Double map(Tuple1<Double> t) throws Exception {
                        double avg = 0;
                        if (count > 0) {
                            avg = t.f0 / count;
                        }

                        return avg;
                    }
                });

        double avg = avgSet
                .collect()
                .get(0);

        System.out.println(avg);
    }

}

Answer 1

You probably forgot to call ExecutionEnvironment.execute() . A DataSet job is not executed before you call that method.

DataSet.count() and DataSet.collect() internal trigger an execution as well.

Apache Flink: When using count() on DataSet only this job will be executed

Question

1 answers

solution1
2 2017-10-27 11:46:57

Apache Flink: When using count() on DataSet only this job will be executed

Question

1 answers

solution1 2 2017-10-27 11:46:57

solution1
2 2017-10-27 11:46:57