简体   繁体   English

Spark:在本地模式下广播使用情况

[英]Spark: Broadcast usage on local mode

I know broadcast allows to keep a read-only copy cached on each machine rather than shipping a copy of it with tasks. 我知道广播允许将只读副本保留在每台计算机上,而不是随任务一起发送副本。 But, I would like to know if broadcasting has any huge impact when it is used in Local Mode as I don't have a cluster of nodes. 但是,我想知道广播在本地模式下使用时是否会产生巨大影响,因为我没有节点集群。 Or is it just ok to use without broadcast in a local mode? 还是可以在本地模式下不进行广播就可以使用? I'm just trying to understand its usage. 我只是想了解它的用法。

Spark Version #2.0,Scala Version #2.10 Local Mode - 8Cores CPU 64GB RAM Spark版本#2.0,Scala版本#2.10本地模式-8Cores CPU 64GB RAM

I have something like below: 我有类似以下内容:

case class EmpDim(name: String,age: Int)

empDF
+-----+-------+------+
|EmpId|EmpName|EmpAge|
+-----+-------+------+
|    1|   John|    32|
|    2|  David|    45|
+-----+-------+------+

deptDF
+------+--------+-----+
|DeptID|DeptName|EmpID|
+------+--------+-----+
|     1|   Admin|    1|
|     2|      HR|    2|
|     3| Finance|    4|
+------+--------+-----+

val empRDD = empDF.rdd.map(x => (x.getInt(0), EmpDim(x.getString(1), x.getInt(2))))

val lookupMap = empRDD.collectAsMap() //Without Broadcast
val broadCastLookupMap: Broadcast[Map[Int,EmpDim]] = sc.broadcast(empRDD.collectAsMap()) //With Broadcast

def lookup(lookupMap:Map[Int,EmpDim]) = udf[Option[EmpDim],Int]((empID:Int) => lookupMap.lift(empID))

val combinedDF = deptDF.withColumn("lookupEmp",lookup(lookupMap)($"EmpID")) //Without Broadcast
                       .withColumn("broadCastLookupEmp",lookup(broadCastLookupMap.value)($"EmpID")) //With Broadcast
                       .withColumn("EmpName",coalesce($"lookupEmp.name",lit("Unknown - No Name to Lookup")))
                       .withColumn("EmpAge",coalesce($"lookupEmp.age",lit("Unknown - No Age to Lookup")))
                       .drop("lookupEmp")
                       .drop("broadCastLookupEmp")

+------+--------+-----+---------------------------+--------------------------+
|DeptID|DeptName|EmpID|EmpName                    |EmpAge                    |
+------+--------+-----+---------------------------+--------------------------+
|1     |Admin   |1    |John                       |32                        |
|2     |HR      |2    |David                      |45                        |
|3     |Finance |4    |Unknown - No Name to Lookup|Unknown - No Age to Lookup|
+------+--------+-----+---------------------------+--------------------------+

In the above scenario, is it advisable to use broadcast or it's kind of overkill? 在上述情况下,建议使用广播还是过分的? Please advice 请指教

When used like this, broadcasting has no value at all. 像这样使用时,广播根本没有任何价值。

When you call: 你打电话时:

lookup(broadCastLookupMap.value)($"EmpID")

broadCastLookupMap.value will be evaluated locally, according to Scala substitution model. broadCastLookupMap.value将根据Scala替换模型在本地进行评估。

Correct implementation would be: 正确的实现将是:

def lookup(lookupMap: Broadcast[Map[Int, EmpDim]]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.value.lift(empID)
)

and called: 并称为:

lookup(broadCastLookupMap)($"EmpID")

which might have some positive impact depending on actual execution plan. 根据实际执行计划,这可能会产生一些积极影响。 Local or non-local mode - the same rules apply 本地或非本地模式-适用相同规则

  • If data is reused between stages (explicitly or implicitly) broadcasting can be useful. 如果在阶段之间(显式或隐式)重用数据,广播将很有用。
  • If data is used only once in the pipeline, standard closure / argument processing mechanism is enough. 如果数据在管道中仅使用一次,则标准的闭包/参数处理机制就足够了。

Here nothing suggest the first case, so broadcast should be obsolete but if you want to be sure, test both solution using live environment and compare results. 这里没有什么可以建议第一种情况,因此广播应该过时了,但是如果您想确定的话,请使用实时环境测试两种解决方案并比较结果。

Calling by name should work too: 通过名称调用也应该起作用:

def lookup(lookupMap: => Map[Int,EmpDim]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.lift(empID)
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM