Spark-groupByKey其他选项

Question

We have a need to control huge dataset using Spark. 我们需要使用Spark控制庞大的数据集。 The control consists of grouping the data by key (we use for this : groupByKey() ) and then looping for each grouped data to check coherence between them. 该控件包括按键对数据进行分组（为此我们使用了： groupByKey() ），然后循环访问每个分组的数据以检查它们之间的一致性。

For example we have the below csv file contains these columns to check : 例如，我们在下面的csv文件中包含以下要检查的列：

id;dateBegin;dateEnd;event;dateEvent
1;12/02/2015;30/05/2015;active;05/04/2015
1;12/06/2015;30/07/2015;dead;05/07/2015
2;12/02/2016;30/07/2016;dead;05/04/2015

We used JavaRdd<String>.map().groupByKey() but Spark freezes for large dataset. 我们使用JavaRdd<String>.map().groupByKey()但Spark冻结了大型数据集。

Are there other options to use? 还有其他选择吗？ Thank you 谢谢

Answer 1

According to this documentation: Avoid GroupByKey 根据此文档：避免使用GroupByKey

reduceByKey() is good for larger data sets as, Spark (before shuffling data) can combine the output with a common key. reduceByKey（）适用于较大的数据集，因为Spark（在重排数据之前）可以将输出与公共密钥合并。 But, groupByKey() shuffles the data (kv pairs) creating unnecessary data sets. 但是， groupByKey（）将数据（kv对） 混洗，从而创建了不必要的数据集。

Look for better (other) alternatives to groupByKey . 寻找更好的groupByKey替代方案。 Like, 喜欢，

combineByKey combineByKey
foldByKey foldByKey

Spark-groupByKey其他选项

问题描述

1 个解决方案

解决方案1
0 2018-10-09 18:17:14

Spark-groupByKey其他选项

问题描述

1 个解决方案

解决方案1 0 2018-10-09 18:17:14

解决方案1
0 2018-10-09 18:17:14