简体   繁体   中英

Spark: Broadcast usage on local mode

I know broadcast allows to keep a read-only copy cached on each machine rather than shipping a copy of it with tasks. But, I would like to know if broadcasting has any huge impact when it is used in Local Mode as I don't have a cluster of nodes. Or is it just ok to use without broadcast in a local mode? I'm just trying to understand its usage.

Spark Version #2.0,Scala Version #2.10 Local Mode - 8Cores CPU 64GB RAM

I have something like below:

case class EmpDim(name: String,age: Int)

empDF
+-----+-------+------+
|EmpId|EmpName|EmpAge|
+-----+-------+------+
|    1|   John|    32|
|    2|  David|    45|
+-----+-------+------+

deptDF
+------+--------+-----+
|DeptID|DeptName|EmpID|
+------+--------+-----+
|     1|   Admin|    1|
|     2|      HR|    2|
|     3| Finance|    4|
+------+--------+-----+

val empRDD = empDF.rdd.map(x => (x.getInt(0), EmpDim(x.getString(1), x.getInt(2))))

val lookupMap = empRDD.collectAsMap() //Without Broadcast
val broadCastLookupMap: Broadcast[Map[Int,EmpDim]] = sc.broadcast(empRDD.collectAsMap()) //With Broadcast

def lookup(lookupMap:Map[Int,EmpDim]) = udf[Option[EmpDim],Int]((empID:Int) => lookupMap.lift(empID))

val combinedDF = deptDF.withColumn("lookupEmp",lookup(lookupMap)($"EmpID")) //Without Broadcast
                       .withColumn("broadCastLookupEmp",lookup(broadCastLookupMap.value)($"EmpID")) //With Broadcast
                       .withColumn("EmpName",coalesce($"lookupEmp.name",lit("Unknown - No Name to Lookup")))
                       .withColumn("EmpAge",coalesce($"lookupEmp.age",lit("Unknown - No Age to Lookup")))
                       .drop("lookupEmp")
                       .drop("broadCastLookupEmp")

+------+--------+-----+---------------------------+--------------------------+
|DeptID|DeptName|EmpID|EmpName                    |EmpAge                    |
+------+--------+-----+---------------------------+--------------------------+
|1     |Admin   |1    |John                       |32                        |
|2     |HR      |2    |David                      |45                        |
|3     |Finance |4    |Unknown - No Name to Lookup|Unknown - No Age to Lookup|
+------+--------+-----+---------------------------+--------------------------+

In the above scenario, is it advisable to use broadcast or it's kind of overkill? Please advice

When used like this, broadcasting has no value at all.

When you call:

lookup(broadCastLookupMap.value)($"EmpID")

broadCastLookupMap.value will be evaluated locally, according to Scala substitution model.

Correct implementation would be:

def lookup(lookupMap: Broadcast[Map[Int, EmpDim]]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.value.lift(empID)
)

and called:

lookup(broadCastLookupMap)($"EmpID")

which might have some positive impact depending on actual execution plan. Local or non-local mode - the same rules apply

  • If data is reused between stages (explicitly or implicitly) broadcasting can be useful.
  • If data is used only once in the pipeline, standard closure / argument processing mechanism is enough.

Here nothing suggest the first case, so broadcast should be obsolete but if you want to be sure, test both solution using live environment and compare results.

Calling by name should work too:

def lookup(lookupMap: => Map[Int,EmpDim]) = udf[Option[EmpDim],Int](
  (empID:Int) => lookupMap.lift(empID)
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM