简体   繁体   English

是否可以对 Apache Spark 中的所有工作人员执行命令?

[英]Is it possible to execute a command on all workers within Apache Spark?

I have a situation where I want to execute a system process on each worker within Spark.我有一种情况,我想在 Spark 中的每个工作人员上执行一个系统进程。 I want this process to be run an each machine once.我希望这个过程在每台机器上运行一次。 Specifically this process starts a daemon which is required to be running before the rest of my program executes.具体来说,这个进程会启动一个守护进程,它需要在我的程序的其余部分执行之前运行。 Ideally this should execute before I've read any data in.理想情况下,这应该在我读入任何数据之前执行。

I'm on Spark 2.0.2 and using dynamic allocation.我在 Spark 2.0.2 上使用动态分配。

You may be able to achieve this with a combination of lazy val and Spark broadcast.您可以结合使用惰性 val 和 Spark 广播来实现这一点。 It will be something like below.它会像下面这样。 (Have not compiled below code, you may have to change few things) (下面的代码还没编译,你可能需要改变一些东西)

object ProcessManager {
  lazy val start = // start your process here.
}

You can broadcast this object at the start of your application before you do any transformations.在进行任何转换之前,您可以在应用程序开始时广播此对象。

val pm = sc.broadcast(ProcessManager)

Now, you can access this object inside your transformation like you do with any other broadcast variables and invoke the lazy val.现在,您可以像处理任何其他广播变量一样在转换中访问此对象并调用惰性 val。

rdd.mapPartition(itr => {
  pm.value.start
  // Other stuff here.
}

An object with static initialization which invokes your system process should do the trick.一个调用系统进程的静态初始化object应该可以解决问题。

object SparkStandIn extends App {
  object invokeSystemProcess {
    import sys.process._
    val errorCode = "echo Whatever you put in this object should be executed once per jvm".!

    def doIt(): Unit = {
      // this object will construct once per jvm, but objects are lazy in
      // another way to make sure instantiation happens is to check that the errorCode does not represent an error
    }
  }
  invokeSystemProcess.doIt()
  invokeSystemProcess.doIt() // even if doIt is invoked multiple times, the static initialization happens once
}

A specific answer for a specific use case, I have a cluster with 50 nodes and I wanted to know which ones have CET timezone set:针对特定用例的特定答案,我有一个包含 50 个节点的集群,我想知道哪些节点设置了 CET 时区:

(1 until 100).toSeq.toDS.
mapPartitions(itr => {
        sys.process.Process(
                Seq("bash", "-c", "echo $(hostname && date)")
        ).
        lines.
        toIterator
}).
collect().
filter(_.contains(" CET ")).
distinct.
sorted.
foreach(println)

Notice I don't think it's guaranteed 100% you'll get a partition for every node so the command might not get run on every node, even using using a 100 elements Dataset in a cluster with 50 nodes like the previous example.请注意,我认为不能保证 100% 为每个节点获得一个分区,因此该命令可能不会在每个节点上运行,即使在具有 50 个节点的集群中使用 100 个元素的数据集,如前一个示例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM