简体   繁体   English

广播Spark作业的更新

[英]Broadcasting updates on spark jobs

I've seen this question asked here but they essentially focus on spark streaming and I can't find a proper solution to work on batch. 我在这里已经看到了这个问题,但是他们本质上只专注于火花流传输,而我找不到适合批量处理的合适解决方案。 The idea is to loop through several days and at each iteration/day it updates the information about the previous day (that is used for the current iteration). 这个想法是循环几天,并且在每次迭代/每天,它都会更新有关前一天的信息(用于当前迭代)。 The code looks like the following: 该代码如下所示:

var prevIterDataRdd = // some RDD

days.foreach(folder => {
  val previousData : Map[String, Double] = parseResult(prevIterDataRdd)
  val broadcastMap = sc.broadcast(previousData)

  val (result, previousStatus) =
    processFolder(folder, broadcastMap)

  // store result
  result.write.csv(outputPath)

  // updating the RDD that enables me to extract previousData to update broadcast
  val passingPrevStatus = prevIterDataRdd.subtractByKey(previousStatus)
  prevIterDataRdd = previousStatus.union(passingPrevStatus)

  broadcastMap.unpersist(true)
  broadcastMap.destroy()
})

Using broadcastMap.destroy() does not run because it does not let me use the broadcastMap again (which I actually don't understand because it should be totally unrelated - immutable). 使用broadcastMap.destroy()不会运行,因为它不允许我再次使用broadcastMap(我实际上不理解,因为它应该是完全不相关的-不可变的)。

How should I run this loop and update the broadcast variable at each iteration? 我应该如何运行此循环并在每次迭代时更新广播变量?

When using method unpersist I pass the true argument in order to make it blocking. 当使用unpersist方法unpersist我传递了true参数以使其阻塞。 Is sc.broadcast() also blocking? sc.broadcast()也阻塞了?

Do I really need unpersist() if I'm anyway broadcasting again? 如果我仍然要再次广播,我真的需要unpersist()吗?

Why can't I use the broadcast again after using destroy given that I'm creating a new broadcast variable? 考虑到要创建新的广播变量,为什么在使用destroy之后不能再次使用广播?

Broadcast variables are immutable but you can create a new broadcast variable. 广播变量是不可变的,但是您可以创建一个新的广播变量。 This new broadcast variable can be used in the next iteration. 可以在下一次迭代中使用此新的广播变量。

All you need to do is to change the reference to the newly created broadcast , unpersist the old broadcast from the executors and destroy it from the driver. 您需要做的就是更改对新创建的广播的引用,从执行程序中取消旧广播,并从驱动程序中销毁它。

Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method. 类级别定义变量,这将允许您更改驱动程序中广播变量的引用并使用destroy方法。

object Main extends App {

  // defined and initialized at class level to allow reference change
  var previousData: Map[String, Double] = null

  override def main(args: Array[String]): Unit = {
    //your code

  }
}

You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. 不允许在变量上使用destroy方法,因为驱动程序中不再存在引用。 Changing the reference to the new broadcast variable can resolve the issue. 将引用更改为新的广播变量可以解决该问题。

Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors. Unpersist仅从执行器中删除数据 ,因此,当重新访问该变量时,驱动程序会将其重新发送给执行器。

blocking = true will allow you let the application completely remove the data from the executor before the next access. blocking = true将允许您在下次访问之前让应用程序从执行程序中完全删除数据。

sc.broadcast() - There is no official documentation saying that it is blocking . sc.broadcast() - 没有官方文档说它正在阻塞 Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. 尽管在运行下一行代码之前应用程序将立即开始向执行者广播数据,所以如果数据很大,则可能会使应用程序变慢。 So be care full on how you are using it . 因此,请谨慎使用它。

It is a good practice to call unpersist before destroy .This will help you get rid of data completely from executors and driver. 在销毁之前调用unpersist是一个好习惯,这将帮助您完全摆脱执行者和驱动程序的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM