[英]How to Reference Spark Broadcast Variables Outside of Scope
All the examples I've seen for Spark broadcast variables define them in the scope of the functions using them ( map()
, join()
, etc.). 我在Spark广播变量中看到的所有示例都在使用它们的函数范围内定义它们(
map()
, join()
等)。 I would like to use both a map()
function and mapPartitions()
function that reference a broadcast variable, but I would like to modularize them so I can use the same functions for unit testing purposes. 我想同时使用
map()
函数和引用广播变量的mapPartitions()
函数,但我想将它们模块化,以便我可以使用相同的函数进行单元测试。
A thought I had was to curry the function so that I pass a reference to the broadcast variable when using either a map
or mapPartitions
call. 我的想法是调整函数,以便在使用
map
或mapPartitions
调用时传递对广播变量的引用。
I had something like this in mind (pseudo-code): 我有类似的想法(伪代码):
// firstFile.scala
// ---------------
def mapper(bcast: Broadcast)(row: SomeRow): Int = {
bcast.value(row._1)
}
def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
val broadcastVariable = bcast.value
for {
i <- iter
} yield broadcastVariable(i)
})
// secondFile.scala
// ----------------
import firstFile.{mapMyPartition, mapper}
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable))
.mapPartitions(mapMyPartition(bcastVariable))
Your solution should work fine. 您的解决方案应该正常工作 In both cases the function passed to
map{Partitions}
will contain a reference to the broadcast variable itself when serialized, but not to its value, and only call bcast.value
when calculated on the node. 在这两种情况下,传递给
map{Partitions}
的函数将在序列化时包含对广播变量本身的引用,但不包含对其值的引用,并且仅在节点上计算时调用bcast.value
。
What needs to be avoided is something like 需要避免的是类似的东西
def mapper(bcast: Broadcast): SomeRow => Int = {
val value = bcast.value
row => value(row._1)
}
You are doing this correctly. 你这样做是正确的。 You just have to remember to pass the broadcast reference and not the value itself.
您只需记住传递广播参考而不是值本身。 Using your example the difference might be shown as follows:
使用您的示例,差异可能如下所示:
a) efficient way: a)有效的方式:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
b) inefficient way: b)效率低下的方式:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
Of course in the second example mapper
and mapMyPartition
would have slightly different signature. 当然在第二个例子中
mapper
和mapMyPartition
会有略微不同的签名。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.