简体   繁体   English

reduceByKey:它在内部是如何工作的?

[英]reduceByKey: How does it work internally?

I am new to Spark and Scala.我是 Spark 和 Scala 的新手。 I was confused about the way reduceByKey function works in Spark.我对 reduceByKey function 在 Spark 中的工作方式感到困惑。 Suppose we have the following code:假设我们有以下代码:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

The map function is clear: s is the key and it points to the line from data.txt and 1 is the value. map function 很明确:s 是键,它指向data.txt中的行,1 是值。

However, I didn't get how the reduceByKey works internally?但是,我不明白 reduceByKey 在内部是如何工作的? Does "a" points to the key? “a”是否指向键? Alternatively, does "a" point to "s"?或者,“a”是否指向“s”? Then what does represent a + b?那么什么代表a + b? how are they filled?它们是如何填充的?

Let's break it down to discrete methods and types. 让我们分解为离散的方法和类型。 That usually exposes the intricacies for new devs: 这通常暴露了新开发者的错综复杂:

pairs.reduceByKey((a, b) => a + b)

becomes

pairs.reduceByKey((a: Int, b: Int) => a + b)

and renaming the variables makes it a little more explicit 并重命名变量使它更明确一些

pairs.reduceByKey((accumulatedValue: Int, currentValue: Int) => accumulatedValue + currentValue)

So, we can now see that we are simply taking an accumulated value for the given key and summing it with the next value of that key . 因此,我们现在可以看到,我们只是为给定键获取累计值并将其与该的下一个值相加。 NOW, let's break it further so we can understand the key part. 现在,让我们进一步分解,以便我们理解关键部分。 So, let's visualize the method more like this: 所以,让我们更像这样的方法:

pairs.reduce((accumulatedValue: List[(String, Int)], currentValue: (String, Int)) => {
  //Turn the accumulated value into a true key->value mapping
  val accumAsMap = accumulatedValue.toMap   
  //Try to get the key's current value if we've already encountered it
  accumAsMap.get(currentValue._1) match { 
    //If we have encountered it, then add the new value to the existing value and overwrite the old
    case Some(value : Int) => (accumAsMap + (currentValue._1 -> (value + currentValue._2))).toList
    //If we have NOT encountered it, then simply add it to the list
    case None => currentValue :: accumulatedValue 
  }
})

So, you can see that the reduce ByKey takes the boilerplate of finding the key and tracking it so that you don't have to worry about managing that part. 因此,您可以看到reduce ByKey采用寻找密钥并跟踪它的样板,以便您不必担心管理该部分。

Deeper, truer if you want 更深入,更真实,如果你想

All that being said, that is a simplified version of what happens as there are some optimizations that are done here. 所有这一切,这是一个简化的版本,因为这里有一些优化。 This operation is associative, so the spark engine will perform these reductions locally first (often termed map-side reduce) and then once again at the driver. 此操作是关联的,因此火花引擎将首先在本地执行这些减少(通常称为地图侧减少),然后再次在驾驶员处执行。 This saves network traffic; 这节省了网络流量; instead of sending all the data and performing the operation, it can reduce it as small as it can and then send that reduction over the wire. 而不是发送所有数据并执行操作,它可以尽可能小地减少它,然后通过线路发送减少量。

One requirement for the reduceByKey function is that is must be associative. reduceByKey函数的一个要求是必须是关联的。 To build some intuition on how reduceByKey works, let's first see how an associative associative function helps us in a parallel computation: 为了建立一些关于reduceByKey如何工作的直觉,让我们首先看看关联关联函数如何帮助我们进行并行计算:

联合功能在行动中

As we can see, we can break an original collection in pieces and by applying the associative function, we can accumulate a total. 正如我们所看到的,我们可以分解原始集合,并通过应用关联函数,我们可以累积总数。 The sequential case is trivial, we are used to it: 1+2+3+4+5+6+7+8+9+10. 连续的情况是微不足道的,我们习惯了它:1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10。

Associativity lets us use that same function in sequence and in parallel. 关联性允许我们按顺序和并行使用相同的功能。 reduceByKey uses that property to compute a result out of an RDD, which is a distributed collection consisting of partitions. reduceByKey使用该属性计算RDD的结果,RDD是由分区组成的分布式集合。

Consider the following example: 请考虑以下示例:

// collection of the form ("key",1),("key,2),...,("key",20) split among 4 partitions
val rdd =sparkContext.parallelize(( (1 to 20).map(x=>("key",x))), 4)
rdd.reduceByKey(_ + _)
rdd.collect()
> Array[(String, Int)] = Array((key,210))

In spark, data is distributed into partitions. 在spark中,数据被分配到分区中。 For the next illustration, (4) partitions are to the left, enclosed in thin lines. 对于下一个图示,(4)分区在左侧,用细线包围。 First, we apply the function locally to each partition, sequentially in the partition, but we run all 4 partitions in parallel. 首先,我们在分区中按顺序将函数本地应用于每个分区,但我们并行运行所有4个分区。 Then, the result of each local computation are aggregated by applying the same function again and finally come to a result. 然后,通过再次应用相同的函数来聚合每个局部计算的结果,最后得到结果。

在此输入图像描述

reduceByKey is an specialization of aggregateByKey aggregateByKey takes 2 functions: one that is applied to each partition (sequentially) and one that is applied among the results of each partition (in parallel). reduceByKey是的专业化aggregateByKey aggregateByKey需要2层的功能:被施加到被每个分区(并行)的结果中应用的每个分区(顺序地)和一个一个。 reduceByKey uses the same associative function on both cases: to do a sequential computing on each partition and then combine those results in a final result as we have illustrated here. reduceByKey在两种情况下使用相同的关联函数:在每个分区上执行顺序计算,然后将这些结果组合在最终结果中,如此处所示。

In your example of 在你的例子中

val counts = pairs.reduceByKey((a,b) => a+b)

a and b are both Int accumulators for _2 of the tuples in pairs . ab都是pairs的元组_2Int累加器。 reduceKey will take two tuples with the same value s and use their _2 values as a and b , producing a new Tuple[String,Int] . reduceKey将使用具有相同值s两个元组并将其_2值用作ab ,从而生成新的Tuple[String,Int] This operation is repeated until there is only one tuple for each key s . 重复此操作,直到只有一个为每个键元组s

Unlike non- Spark (or, really, non-parallel) reduceByKey where the first element is always the accumulator and the second a value, reduceByKey operates in a distributed fashion, ie each node will reduce it's set of tuples into a collection of uniquely-keyed tuples and then reduce the tuples from multiple nodes until there is a final uniquely-keyed set of tuples. 与非Spark (或实际上,非并行) reduceByKey ,其中第一个元素始终是累加器,第二个元素是值, reduceByKey以分布式方式运行,即每个节点将其元组集合减少为唯一的集合 - 键控元组,然后从多个节点减少元组,直到有一组最终唯一键控元组。 This means as the results from nodes are reduced, a and b represent already reduced accumulators. 这意味着当节点的结果减少时, ab表示已减少的累加器。

Spark RDD reduceByKey function merges the values for each key using an associative reduce function. Spark RDD reduceByKey函数使用关联reduce函数合并每个键的值。

The reduceByKey function works only on the RDDs and this is a transformation operation that means it is lazily evaluated. reduceByKey函数仅适用于RDD,这是一个转换操作,意味着它被懒惰地评估。 And an associative function is passed as a parameter, which is applied to source RDD and creates a new RDD as a result. 并且关联函数作为参数传递,该参数应用于源RDD并作为结果创建新的RDD。

So in your example, rdd pairs has a set of multiple paired elements like (s1,1), (s2,1) etc. And reduceByKey accepts a function (accumulator, n) => (accumulator + n), which initialise the accumulator variable to default value 0 and adds up the element for each key and return the result rdd counts having the total counts paired with key. 所以在你的例子中,rdd对有一组多个配对元素,如(s1,1),(s2,1)等。并且reduceByKey接受一个函数(accumulator,n)=>(累加器+ n),它初始化累加器变量为默认值0并为每个键添加元素并返回结果rdd计数具有与键配对的总计数。

Simple if your input RDD data look like this: (aa,1) (bb,1) (aa,1) (cc,1) (bb,1)如果您的输入 RDD 数据如下所示,则很简单: (aa,1) (bb,1) (aa,1) (cc,1) (bb,1)

and if you apply reduceByKey on above rdd data then few you have to remember, reduceByKey always takes 2 input (x,y) and always works with two rows at a time.如果你在上面的rdd数据上应用reduceByKey,那么你需要记住的很少,reduceByKey总是需要2个输入(x,y)并且总是一次处理两行。 As it is reduceByKey it will combine two rows of same key and combine the result of value.由于它是reduceByKey,它将组合两行相同的键并组合值的结果。

val rdd2 = rdd.reduceByKey((x,y) => x+y) rdd2.foreach(println) val rdd2 = rdd.reduceByKey((x,y) => x+y) rdd2.foreach(println)

output: (aa,2) (bb,2) (cc,1) output: (aa,2) (bb,2) (cc,1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM