Broadcasting updates on spark jobs

Question

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). The code looks like the following:

var prevIterDataRdd = // some RDD

days.foreach(folder => {
  val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
  val broadcastMap = sc.broadcast(previousData)

  val (result, previousStatus) =
    processFolder(folder, broadcastMap)

  // store result
  result.write.csv(outputPath)

  // updating the RDD that enables me to extract previousData to update broadcast
  val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
  prevIterDataRdd = previousStatus.union(passingPrevStatus)

  broadcastMap.unpersist(true)
  broadcastMap.destroy()
})

Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable).

How should I run this loop and update the broadcast variable at each iteration?

When using method unpersist I pass the true argument in order to make it blocking. Is sc.broadcast() also blocking?

Do I really need unpersist() if I'm anyway broadcasting again?

Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable?

Answer 1

Broadcast variables are immutable but you can create a new broadcast variable. This new broadcast variable can be used in the next iteration.

All you need to do is to change the reference to the newly created broadcast , unpersist the old broadcast from the executors and destroy it from the driver.

Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.

object Main extends App {

  // defined and initialized at class level to allow reference change
  var previousData: Map[String, Double] = null

  override def main(args: Array[String]): Unit = {
    //your code

  }
}

You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.

Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.

blocking = true will allow you let the application completely remove the data from the executor before the next access.

sc.broadcast() - There is no official documentation saying that it is blocking . Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .

It is a good practice to call unpersist before destroy .This will help you get rid of data completely from executors and driver.

Broadcasting updates on spark jobs

Question

1 answers

solution1
2 ACCPTED 2019-09-12 10:22:29

Broadcasting updates on spark jobs

Question

1 answers

solution1 2 ACCPTED 2019-09-12 10:22:29

solution1
2 ACCPTED 2019-09-12 10:22:29