[英]How to replace the groupByKey with reduceByKey to return as an Iterable value in Spark java?
I have a spark java program where a groupByKey with a mapValues step is done and it returns a PairRDD with value as an Iterable
of all the input rdd values. 我有一个spark java程序,其中groupByKey带有mapValues步骤,它返回一个PairRDD,其值为所有输入rdd值的
Iterable
。 I have read that replacing reduceByKey in the place of groupByKey with mapValues will give a performance gain, but i don't know how to apply reduceByKey
to my problem here. 我已经读过用mapValues替换groupByKey替换reduceByKey会带来性能提升,但我不知道如何在这里将
reduceByKey
应用于我的问题。
Specifically i have the an input pair RDD which has value with type Tuple5
. 具体来说,我有一个输入对RDD,其值为
Tuple5
。 After the groupByKey and mapValues transformations, i need to get a Key-Value pair RDD where the value needs to be an Iterable of the input values. 在groupByKey和mapValues转换之后,我需要获得一个键值对RDD,其中值必须是输入值的Iterable。
JavaPairRDD<Long,Tuple5<...>> inputRDD;
...
...
...
JavaPairRDD<Long, Iterable<Tuple5<...>>> groupedRDD = inputRDD
.groupByKey()
.mapValues(
new Function<Iterable<Tuple5<...>>,Iterable<Tuple5<...>>>() {
@Override
public Iterable<Tuple5<...>> call(
Iterable<Tuple5<...>> v1)
throws Exception {
/*
Some steps here..
*/
return mappedValue;
}
});
Is there a way by which i could get the above transformation using reduceByKey
? 有没有办法让我使用
reduceByKey
进行上述转换?
I've been using Scala on Spark, so this isn't going to be the exact answer you might prefer. 我一直在Spark上使用Scala,所以这不是你可能更喜欢的答案。 The main difference in coding between
groupByKey/mapValues
and reduceByKey
can be seen using a trivial example adapted from this article : 使用本文改编的一个简单示例可以看出
groupByKey/mapValues
和reduceByKey
之间编码的主要区别:
val words = Array("one", "two", "two", "three", "three", "three")
val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))
val wordCountsWithGroup = wordPairsRDD.
groupByKey.
mapValues(_.sum)
wordCountsWithGroup.collect
res1: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
val wordCountsWithReduce = wordPairsRDD.
reduceByKey(_ + _)
wordCountsWithReduce.collect
res2: Array[(String, Int)] = Array((two,2), (one,1), (three,3))
In this example, where x => x.sum
(ie _.sum) is used in mapValues
, it'll be (acc, x) => acc + x
(ie _ + _) in reduceByKey
. 在这个例子中,其中
x => x.sum
(即_.sum)中使用mapValues
,这将是(acc, x) => acc + x
(即_ + _)中reduceByKey
。 The function signatures are vastly different. 功能签名有很大的不同。 In
mapValues
, you're processing a collection of the grouped values, whereas in reduceByKey
you're performing a reduction. 在
mapValues
,您正在处理分组值的集合,而在reduceByKey
您正在执行缩减。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.