简体   繁体   English

Flink:DataSet.count()是瓶颈-如何并行计算?

[英]Flink: DataSet.count() is bottleneck - How to count parallel?

I am learning Map-Reduce using Flink and have a question about how to efficiently count elements in a DataSet. 我正在学习使用Flink进行Map-Reduce,并且对如何有效地计算数据集中的元素有疑问。 What I have so far is this: 我到目前为止所拥有的是:

DataSet<MyClass> ds = ...;
long num = ds.count();

When executing this, in my flink log it says 执行此操作时,在我的flink日志中说

12/03/2016 19:47:27 DataSink (count())(1/1) switched to RUNNING 12/03/2016 19:47:27 DataSink(count())(1/1)切换为RUNNING

So there is only one CPU used (i have four and other commands like reduce use all of them). 因此,仅使用了一个CPU(我有四个和其他命令,例如reduce use all)。

I think count() internally collects the DataSet from all four CPUs and counts them sequentially instead of having each CPU count its part and then sum it up. 我认为count()在内部从所有四个CPU收集DataSet并按顺序对其进行计数,而不是让每个CPU对其部分进行计数然后求和。 Is that true? 真的吗?

If yes, how can I take advantage of all my CPUs? 如果是,我如何利用所有CPU? Would it be a good idea to first map my DataSet to a 2-tuple that contains the original value as first item and the long value 1 as second item and then aggregate it using the SUM function? 首先将我的DataSet映射到一个包含原始值作为第一项并包含long值1作为第二项的2元组,然后使用SUM函数对其进行汇总,将是一个好主意吗?

For example, the DataSet would be mapped to DataSet> where the Long would always be 1. So when I sum up all items the sum of the second value of the tuple would be the correct count value. 例如,将DataSet映射到Long始终为1的DataSet>。因此,当我对所有项求和时,元组的第二个值的总和将是正确的计数值。

What is the best practice to count items in a DataSet? 对数据集中的项目进行计数的最佳实践是什么?

Regards Simon 问候西蒙

DataSet#count() is a non-parallel operation and thus can only use a single thread. DataSet#count()是非并行操作,因此只能使用一个线程。

You would do a count-by-key to get parallelization and apply a final sum over you key counts to get to overall count to speed up you computation. 您将按键进行计数以获得并行化,并对键计数应用最终的总和,以得出总计数以加快计算速度。

Is this a good solution? 这是一个好的解决方案吗?

DataSet<Tuple1<Long>> x = ds.map(new MapFunction<MyClass, Tuple1<Long>>() { 
    @Override public Tuple1<Long> map(MyClass t) throws Exception { 
        return new Tuple1<Long>(1L); 
    } 
}).groupBy(0).sum(0);

Long c = x.collect().iterator().next().f0;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM