Flink：DataSet.count（）是瓶颈-如何并行计算？

Question

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. 我正在学习使用Flink进行Map-Reduce，并且对如何有效地计算数据集中的元素有疑问。 What I have so far is this: 我到目前为止所拥有的是：

DataSet<MyClass> ds = ...;
long num = ds.count();

When executing this, in my flink log it says 执行此操作时，在我的flink日志中说

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING 12/03/2016 19:47:27 DataSink（count（））（1/1）切换为RUNNING

So there is only one CPU used (i have four and other commands like reduce use all of them). 因此，仅使用了一个CPU（我有四个和其他命令，例如reduce use all）。

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. 我认为count（）在内部从所有四个CPU收集DataSet并按顺序对其进行计数，而不是让每个CPU对其部分进行计数然后求和。 Is that true? 真的吗？

If yes, how can I take advantage of all my CPUs? 如果是，我如何利用所有CPU？ Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function? 首先将我的DataSet映射到一个包含原始值作为第一项并包含long值1作为第二项的2元组，然后使用SUM函数对其进行汇总，将是一个好主意吗？

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value. 例如，将DataSet映射到Long始终为1的DataSet>。因此，当我对所有项求和时，元组的第二个值的总和将是正确的计数值。

What is the best practice to count items in a DataSet? 对数据集中的项目进行计数的最佳实践是什么？

Regards Simon 问候西蒙

Answer 1

DataSet#count() is a non-parallel operation and thus can only use a single thread. DataSet#count()是非并行操作，因此只能使用一个线程。

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation. 您将按键进行计数以获得并行化，并对键计数应用最终的总和，以得出总计数以加快计算速度。

Answer 2

Is this a good solution? 这是一个好的解决方案吗？

DataSet<Tuple1<Long>> x = ds.map(new MapFunction<MyClass, Tuple1<Long>>() { 
    @Override public Tuple1<Long> map(MyClass t) throws Exception { 
        return new Tuple1<Long>(1L); 
    } 
}).groupBy(0).sum(0);

Long c = x.collect().iterator().next().f0;

Flink：DataSet.count（）是瓶颈-如何并行计算？

问题描述

2 个解决方案

解决方案1
0 2016-12-03 20:41:20

解决方案2
0

Flink：DataSet.count（）是瓶颈-如何并行计算？

问题描述

2 个解决方案

解决方案1 0 2016-12-03 20:41:20

解决方案2 0

解决方案1
0 2016-12-03 20:41:20

解决方案2
0