在Apache Spark（Scala）中使用reduceByKey

Question

I have a list of Tuples of type : (user id, name, count). 我有一个类型的元组列表:(用户ID，名称，计数）。

For example, 例如，

val x = sc.parallelize(List(
    ("a", "b", 1),
    ("a", "b", 1),
    ("c", "b", 1),
    ("a", "d", 1))
)

I'm attempting to reduce this collection to a type where each element name is counted. 我正在尝试将此集合减少为计算每个元素名称的类型。

So in above val x is converted to : 所以在上面val x被转换为：

(a,ArrayBuffer((d,1), (b,2)))
(c,ArrayBuffer((b,1)))

Here is the code I am currently using : 这是我目前使用的代码：

val byKey = x.map({case (id,uri,count) => (id,uri)->count})

val grouped = byKey.groupByKey
val count = grouped.map{case ((id,uri),count) => ((id),(uri,count.sum))}
val grouped2: org.apache.spark.rdd.RDD[(String, Seq[(String, Int)])] = count.groupByKey

grouped2.foreach(println)

I'm attempting to use reduceByKey as it performs faster than groupByKey. 我正在尝试使用reduceByKey，因为它比groupByKey执行得更快。

How can reduceByKey be implemented instead of above code to provide the same mapping ? 如何实现reduceByKey而不是上面的代码来提供相同的映射？

Answer 1

Following your code: 按照你的代码：

val byKey = x.map({case (id,uri,count) => (id,uri)->count})

You could do: 你可以这样做：

val reducedByKey = byKey.reduceByKey(_ + _)

scala> reducedByKey.collect.foreach(println)
((a,d),1)
((a,b),2)
((c,b),1)

PairRDDFunctions[K,V].reduceByKey takes an associative reduce function that can be applied to the to type V of the RDD[(K,V)]. PairRDDFunctions[K,V].reduceByKey采用关联reduce函数，该函数可应用于RDD [（K，V）]的V型。 In other words, you need a function f[V](e1:V, e2:V) : V . 换句话说，您需要一个函数f[V](e1:V, e2:V) : V 。 In this particular case with sum on Ints: (x:Int, y:Int) => x+y or _ + _ in short underscore notation. 在这种特殊情况下，对Ints求和:( (x:Int, y:Int) => x+y或_ + _ ，用短下划线表示法。

For the record: reduceByKey performs better than groupByKey because it attemps to apply the reduce function locally before the shuffle/reduce phase. 备案： reduceByKey执行优于groupByKey因为它attemps到洗牌之前在本地应用降低功能/减少的阶段。 groupByKey will force a shuffle of all elements before grouping. groupByKey将在分组之前强制groupByKey所有元素。

Answer 2

Your origin data structure is: RDD[(String, String, Int)], and reduceByKey can only be used if data structure is RDD[(K, V)]. 您的原始数据结构是：RDD [（String，String，Int）]和reduceByKey只能在数据结构为RDD [（K，V）]时使用。

val kv = x.map(e => e._1 -> e._2 -> e._3) // kv is RDD[((String, String), Int)]
val reduced = kv.reduceByKey(_ + _)       // reduced is RDD[((String, String), Int)]
val kv2 = reduced.map(e => e._1._1 -> (e._1._2 -> e._2)) // kv2 is RDD[(String, (String, Int))]
val grouped = kv2.groupByKey()            // grouped is RDD[(String, Iterable[(String, Int)])]
grouped.foreach(println)

Answer 3

The syntax is below: 语法如下：

reduceByKey(func: Function2[V, V, V]): JavaPairRDD[K, V],

which says for the same key in an RDD it takes the values (which will be definitely of same type) performs the operation provided as part of function and returns the value of same type as of parent RDD. 它表示对于RDD中的相同键，它采用值（肯定是相同类型）执行作为函数的一部分提供的操作，并返回与父RDD相同类型的值。

在Apache Spark（Scala）中使用reduceByKey

问题描述

3 个解决方案

解决方案1
28 已采纳 2014-06-05 23:18:15

解决方案2
5 2014-06-06 05:06:44

解决方案3
0 2014-08-11 12:49:11

在Apache Spark（Scala）中使用reduceByKey

问题描述

3 个解决方案

解决方案1 28 已采纳 2014-06-05 23:18:15

解决方案2 5 2014-06-06 05:06:44

解决方案3 0 2014-08-11 12:49:11

解决方案1
28 已采纳 2014-06-05 23:18:15

解决方案2
5 2014-06-06 05:06:44

解决方案3
0 2014-08-11 12:49:11