如何将配置从驱动程序传递给 Spark 中的执行程序？

Question

例如。

object App {

  var confValue: String = ""

  def main(args: Array[String]): Unit = {
    // set conf by cmd args
    confValue = args.head
    // do some context init
    val dataset: Dataset[Int] = ???
    dataset.foreach { row =>
      // get conf from executor
      println(confValue)
    }
  }
}

我想在 executors 上获取 conf，但实际上它无法完成，因为confValue仅在驱动程序上进行了修改

我知道我可以通过像这样的局部变量将confValue传递给执行程序。

  def main(args: Array[String]): Unit = {
    // set conf by cmd args
    val confValue = args[0]
    // do some context init
    val dataset: Dataset[Int] = ???
    dataset.foreach { row =>
      // get conf from executor
      println(confValue)
    }
  }

但我的火花工作是巨大的。 它有这么多的功能。 我不能将 confValue 作为局部变量到处传递。 例如：

  def main(args: Array[String]): Unit = {
    // set conf by cmd args
    val confValue = args[0]
    // do some context init
    val dataset: Dataset[Int] = ???
    dataset.foreach { row =>
      doSomeLogic(row)
    }
  }

  private def doSomeLogic(row: Int): Unit = {
    // get conf from executor
    println(confValue)
  }

有很多doSomeLogic 。 所以我不能将confValue传递给他们所有人。 有没有办法自动将confValue传递给每个执行者？

更新 1

我的火花代码可能如下所示

object App {

  /** env flag, will be inited by cmd args, and be used in executors */
  var env: String = ""
  val spark: SparkSession = ???

  import spark.implicits._

  def main(args: Array[String]): Unit = {
    // read env from args
    env = args.head

    var ds: Dataset[Int] = ???
    ds = doLogic1(ds)
    ds = doLogic2(ds)
    doLogic3(ds)
  }

  private def doLogic1(ds: Dataset[Int]): Dataset[Int] = {
    ds.map { row =>
      // env will be used here
      ???
    }
  }

  private def doLogic2(ds: Dataset[Int]): Dataset[Int] = {
    ds.map { row =>
      // env will be used here
      ???
    }
  }

  private def doLogic3(ds: Dataset[Int]): Dataset[Int] = {
    ds.map { row =>
      // env will be used here
      ???
    }
  }
}

env将在main中初始化，并将在一些doLogicN函数中使用。 我的 spark 项目是一个有很多doLogicN函数的大型项目，因此将env标志传递给每个doLogicN function 会更改太多代码。

将env标志传递给所有doLogicN函数的最简单方法是什么？

最难的一点是env会在executors中使用。 如果它只用于驱动程序，我可以通过全局env变量将它传递到任何地方。 但它在执行程序中不能很好地工作，因为尚未初始化全局env变量。 它只能在驱动程序端启动。

Answer 1

您可以执行以下操作将值广播给所有执行者，然后根据您的要求并可以根据需要使用它。 此外，如果您想并行处理每个分区的数据，您应该为每个分区使用，而不是使用 for each。

下面是如何广播一个值并使用它的示例代码：

//Sample data created
val df = Seq(("a","2020-01-16 08:55:50"),("b","2020-01-16 08:57:37"),("c","2020-01-16 09:00:13"),("d","2020-01-16 09:01:32"),("e","2020-01-16 09:03:32"),("f","2020-01-16 09:06:56")).toDF("ID","timestamp")
//check the partitions that a datframe has
df.rdd.partitions.size
//broadcast the value that you want to broadcast
val confValue = "Test"
val bdct_confvalue = spark.sparkContext.broadcast(confValue);
//using the broadcasted value on each executors nodes as required
df.foreachPartition(partition => {
  println("Confvalue partition =" +bdct_confvalue.value)
 }
)

此外，要查看日志中打印的值，您必须查看执行程序日志而不是驱动程序日志，因为您无法在驱动程序日志中看到此打印语句。 您也无法在 Jupyter 或 Databricks 笔记本等任何笔记本中看到这一点，因为它们会在 UI 上显示驱动程序日志。

Answer 2

我找到了解决我的问题的方法。

conf 可以在提交 spark 作业时由 Spark Conf 设置，例如spark.my.env=env_1

可以通过SparkEnv.get.conf.get("spark.my.env")读取，在sparkContext初始化后，在驱动程序和执行程序之间具有相同的效果。

如何将配置从驱动程序传递给 Spark 中的执行程序？

问题描述

2 个解决方案

解决方案1
0 2021-02-03 04:19:20

解决方案2
0 已采纳 2021-02-04 05:11:22

如何将配置从驱动程序传递给 Spark 中的执行程序？

问题描述

2 个解决方案

解决方案1 0 2021-02-03 04:19:20

解决方案2 0 已采纳 2021-02-04 05:11:22

解决方案1
0 2021-02-03 04:19:20

解决方案2
0 已采纳 2021-02-04 05:11:22