[英]How to Reference Spark Broadcast Variables Outside of Scope
我在Spark广播变量中看到的所有示例都在使用它们的函数范围内定义它们( map()
, join()
等)。 我想同时使用map()
函数和引用广播变量的mapPartitions()
函数,但我想将它们模块化,以便我可以使用相同的函数进行单元测试。
我的想法是调整函数,以便在使用map
或mapPartitions
调用时传递对广播变量的引用。
我有类似的想法(伪代码):
// firstFile.scala
// ---------------
def mapper(bcast: Broadcast)(row: SomeRow): Int = {
bcast.value(row._1)
}
def mapMyPartition(bcast: Broadcast)(iter: Iterator): Iterator {
val broadcastVariable = bcast.value
for {
i <- iter
} yield broadcastVariable(i)
})
// secondFile.scala
// ----------------
import firstFile.{mapMyPartition, mapper}
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable))
.mapPartitions(mapMyPartition(bcastVariable))
您的解决方案应该正常工作 在这两种情况下,传递给map{Partitions}
的函数将在序列化时包含对广播变量本身的引用,但不包含对其值的引用,并且仅在节点上计算时调用bcast.value
。
需要避免的是类似的东西
def mapper(bcast: Broadcast): SomeRow => Int = {
val value = bcast.value
row => value(row._1)
}
你这样做是正确的。 您只需记住传递广播参考而不是值本身。 使用您的示例,差异可能如下所示:
a)有效的方式:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable)) // only the reference to the Map[Int, Int] is serialized and sent to every worker
b)效率低下的方式:
// the whole Map[Int, Int] is serialized and sent to every worker
val bcastVariable = sc.broadcast(Map(0 -> 1, 1 -> 2, 2 -> 3))
rdd
.map(mapper(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
.mapPartitions(mapMyPartition(bcastVariable.value)) // the whole Map[Int, Int] is serialized and sent to every worker
当然在第二个例子中mapper
和mapMyPartition
会有略微不同的签名。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.