如何引用范围之外的Spark广播变量

Question

All the examples I've seen for Spark broadcast variables define them in the scope of the functions using them ( map() , join() , etc.). 我在Spark广播变量中看到的所有示例都在使用它们的函数范围内定义它们（ map() ， join()等）。 I would like to use both a map() function and mapPartitions() function that reference a broadcast variable, but I would like to modularize them so I can use the same functions for unit testing purposes. 我想同时使用map()函数和引用广播变量的mapPartitions()函数，但我想将它们模块化，以便我可以使用相同的函数进行单元测试。

How can I accomplish this? 我怎么能做到这一点？

A thought I had was to curry the function so that I pass a reference to the broadcast variable when using either a map or mapPartitions call. 我的想法是调整函数，以便在使用map或mapPartitions调用时传递对广播变量的引用。

Are there any performance implications by passing around the reference to the broadcast variable that are not normally found when defining the functions inside the original scope? 传递对原始范围内定义函数时通常不会找到的广播变量的引用是否有任何性能影响？

I had something like this in mind (pseudo-code): 我有类似的想法（伪代码）：

// firstFile.scala
// ---------------

def mapper(bcast: Broadcast)(row: SomeRow): Int = {
  bcast.value(row._1)
}

def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
  val broadcastVariable = bcast.value

  for {
    i <- iter
  } yield broadcastVariable(i)
})


// secondFile.scala
// ----------------

import firstFile.{mapMyPartition, mapper}

val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))

rdd
 .map(mapper(bcastVariable))
 .mapPartitions(mapMyPartition(bcastVariable))

Answer 1

Your solution should work fine. 您的解决方案应该正常工作 In both cases the function passed to map{Partitions} will contain a reference to the broadcast variable itself when serialized, but not to its value, and only call bcast.value when calculated on the node. 在这两种情况下，传递给map{Partitions}的函数将在序列化时包含对广播变量本身的引用，但不包含对其值的引用，并且仅在节点上计算时调用bcast.value 。

What needs to be avoided is something like 需要避免的是类似的东西

def mapper(bcast: Broadcast): SomeRow => Int = {
  val value = bcast.value
  row => value(row._1)
}

Answer 2

You are doing this correctly. 你这样做是正确的。 You just have to remember to pass the broadcast reference and not the value itself. 您只需记住传递广播参考而不是值本身。 Using your example the difference might be shown as follows: 使用您的示例，差异可能如下所示：

a) efficient way: a）有效的方式：

// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3)) 

rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker

b) inefficient way: b）效率低下的方式：

// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3)) 

rdd
.map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker

Of course in the second example mapper and mapMyPartition would have slightly different signature. 当然在第二个例子中mapper和mapMyPartition会有略微不同的签名。

如何引用范围之外的Spark广播变量

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-04-25 19:42:15

解决方案2
2 2016-04-25 23:43:02

如何引用范围之外的Spark广播变量

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-04-25 19:42:15

解决方案2 2 2016-04-25 23:43:02

解决方案1
2 已采纳 2016-04-25 19:42:15

解决方案2
2 2016-04-25 23:43:02