I know broadcast allows to keep a read-only copy cached on each machine rather than shipping a copy of it with tasks. But, I would like to know if broadcasting has any huge impact when it is used in Local Mode as I don't have a cluster of nodes. Or is it just ok to use without broadcast in a local mode? I'm just trying to understand its usage.
Spark Version #2.0,Scala Version #2.10 Local Mode - 8Cores CPU 64GB RAM
I have something like below:
case class EmpDim(name: String,age: Int)
empDF
+-----+-------+------+
|EmpId|EmpName|EmpAge|
+-----+-------+------+
| 1| John| 32|
| 2| David| 45|
+-----+-------+------+
deptDF
+------+--------+-----+
|DeptID|DeptName|EmpID|
+------+--------+-----+
| 1| Admin| 1|
| 2| HR| 2|
| 3| Finance| 4|
+------+--------+-----+
val empRDD = empDF.rdd.map(x => (x.getInt(0), EmpDim(x.getString(1), x.getInt(2))))
val lookupMap = empRDD.collectAsMap() //Without Broadcast
val broadCastLookupMap: Broadcast[Map[Int,EmpDim]] = sc.broadcast(empRDD.collectAsMap()) //With Broadcast
def lookup(lookupMap:Map[Int,EmpDim]) = udf[Option[EmpDim],Int]((empID:Int) => lookupMap.lift(empID))
val combinedDF = deptDF.withColumn("lookupEmp",lookup(lookupMap)($"EmpID")) //Without Broadcast
.withColumn("broadCastLookupEmp",lookup(broadCastLookupMap.value)($"EmpID")) //With Broadcast
.withColumn("EmpName",coalesce($"lookupEmp.name",lit("Unknown - No Name to Lookup")))
.withColumn("EmpAge",coalesce($"lookupEmp.age",lit("Unknown - No Age to Lookup")))
.drop("lookupEmp")
.drop("broadCastLookupEmp")
+------+--------+-----+---------------------------+--------------------------+
|DeptID|DeptName|EmpID|EmpName |EmpAge |
+------+--------+-----+---------------------------+--------------------------+
|1 |Admin |1 |John |32 |
|2 |HR |2 |David |45 |
|3 |Finance |4 |Unknown - No Name to Lookup|Unknown - No Age to Lookup|
+------+--------+-----+---------------------------+--------------------------+
In the above scenario, is it advisable to use broadcast or it's kind of overkill? Please advice
When used like this, broadcasting has no value at all.
When you call:
lookup(broadCastLookupMap.value)($"EmpID")
broadCastLookupMap.value
will be evaluated locally, according to Scala substitution model.
Correct implementation would be:
def lookup(lookupMap: Broadcast[Map[Int, EmpDim]]) = udf[Option[EmpDim],Int](
(empID:Int) => lookupMap.value.lift(empID)
)
and called:
lookup(broadCastLookupMap)($"EmpID")
which might have some positive impact depending on actual execution plan. Local or non-local mode - the same rules apply
Here nothing suggest the first case, so broadcast should be obsolete but if you want to be sure, test both solution using live environment and compare results.
Calling by name should work too:
def lookup(lookupMap: => Map[Int,EmpDim]) = udf[Option[EmpDim],Int](
(empID:Int) => lookupMap.lift(empID)
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.