在HDInsight群集上远程执行Spark作业

Question

I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure . 我试图从Microsoft Azure自动在HDInsight群集上启动Spark作业。 I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance. 我知道有几种方法可以自动化Hadoop作业提交（由Azure本身提供），但到目前为止，我还没有找到一种方法来远程运行Spark作业，而无需使用主实例设置RDP。

Is there any way to achieve this? 有没有办法实现这个目标？

Answer 1

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. Spark-jobserver提供RESTful接口，用于提交和管理Apache Spark作业，jar和作业上下文。

https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver

My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically. 我的解决方案是使用Scheduler和Spark-jobserver定期启动Spark-job。

Answer 2

At the moment of this writing, it seems there is no official way of achieving this. 在撰写本文时，似乎没有正式的方法来实现这一目标。 So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. 然而，到目前为止，我已经能够以某种方式使用Oozie shell工作流远程运行Spark作业。 It is nothing but a patch, but so far it has been useful for me. 它只不过是一个补丁，但到目前为止它对我有用。 These are the steps I have followed: 这些是我遵循的步骤：

Prerequisites 先决条件

Microsoft Powershell 微软Powershell
Azure Powershell Azure Powershell

Process 处理

Define an Oozie workflow .xml* file:* **定义Oozie工作流* .xml *文件：**

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
  <start to = "myAction"/>
  <action name="myAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>myScript.cmd</exec>
            <file>wasb://myContainer@myAccount.blob.core.windows.net/myScript.cmd#myScript.cmd</file>
            <file>wasb://myContainer@myAccount.blob.core.windows.net/mySpark.jar#mySpark.jar</file>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar , on the wasb repository. 请注意，无法识别要在哪个HDInsight节点上执行脚本，因此有必要将它与Spark应用程序.jar一起放在wasb存储库中。 It is then redirectioned to the local directory on which the Oozie job is executing. 然后将其重定向到执行Oozie作业的本地目录。

Define the custom script 定义自定义脚本

C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
                                          --master yarn-cluster 
                                          --deploy-mode cluster 
                                          --num-executors 3 
                                          --executor-memory 2g 
                                          --executor-cores 4 
                                          mySpark.jar

It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow: 有必要将.cmd和Spark .jar上传到wasb存储库（此答案中未包含的进程），具体到工作流中指向的方向：

wasb://myContainer@myAccount.blob.core.windows.net/

Define the powershell script 定义powershell脚本

The powershell script is very much taken from the official Oozie on HDInsight tutorial. powershell脚本非常取自官方Oozie的HDInsight教程。 I am not including the script on this answer due to its almost absolute sameness with my approach. 由于它与我的方法几乎完全相同，因此我不在此答案中包含脚本。

I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission. 我在天蓝色反馈门户上提出了一个新的建议，表明需要官方支持远程Spark作业提交。

Answer 3

Updated on 8/17/2016: Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. 2016年8月17日更新：我们的火花群集现在包括一个Livy服务器，它提供休息服务以提交火花作业。 You can automate spark job via Azure Data Factory as well. 您也可以通过Azure Data Factory自动执行spark作业。

Original post: 1) Remote job submission for spark is currently not supported. 原帖：1）目前不支持针对spark的远程作业提交。

2) If you want to automate setting a master every time ( ie adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\\conf\\spark-defaults.conf file with following config: 2）如果您希望每次都自动设置主设备（即每次执行时添加--master yarn-client），您可以使用以下配置在％SPARK_HOME \\ conf \\ spark-defaults.conf文件中设置值：

spark.master yarn-client spark.master纱线客户端

You can find more info on spark-defaults.conf on apache spark website. 您可以在apache spark网站上找到有关spark-defaults.conf的更多信息。

3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time. 3）如果要在部署时将其自动添加到spark-defaults.conf文件，请使用群集自定义功能。

在HDInsight群集上远程执行Spark作业

问题描述

3 个解决方案

解决方案1
4 已采纳 2015-11-26 20:17:38

解决方案2
2 2015-03-02 14:24:37

Prerequisites 先决条件

Process 处理

Define an Oozie workflow .xml* file:* **定义Oozie工作流* .xml *文件：**

Define the custom script 定义自定义脚本

Define the powershell script 定义powershell脚本

解决方案3
1 2015-02-18 20:20:11

在HDInsight群集上远程执行Spark作业

问题描述

3 个解决方案

解决方案1 4 已采纳 2015-11-26 20:17:38

解决方案2 2 2015-03-02 14:24:37

Prerequisites 先决条件

Process 处理

Define an Oozie workflow *.xml* file: 定义Oozie工作流* .xml *文件：

Define the custom script 定义自定义脚本

Define the powershell script 定义powershell脚本

解决方案3 1 2015-02-18 20:20:11

解决方案1
4 已采纳 2015-11-26 20:17:38

解决方案2
2 2015-03-02 14:24:37

Define an Oozie workflow .xml* file:* **定义Oozie工作流* .xml *文件：**

解决方案3
1 2015-02-18 20:20:11