简体   繁体   English

在HDInsight群集上远程执行Spark作业

[英]Remotely execute a Spark job on an HDInsight cluster

I am trying to automatically launch a Spark job on an HDInsight cluster from Microsoft Azure . 我试图从Microsoft Azure自动在HDInsight群集上启动Spark作业。 I am aware that several methods exist to automate Hadoop job submission (provided by Azure itself), but so far I have not been able to found a way to remotely run a Spark job withouth setting a RDP with the master instance. 我知道有几种方法可以自动化Hadoop作业提交(由Azure本身提供),但到目前为止,我还没有找到一种方法来远程运行Spark作业,而无需使用主实例设置RDP。

Is there any way to achieve this? 有没有办法实现这个目标?

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. Spark-jobserver提供RESTful接口,用于提交和管理Apache Spark作业,jar和作业上下文。

https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver

My solution is using both Scheduler and Spark-jobserver to launch the Spark-job periodically. 我的解决方案是使用Scheduler和Spark-jobserver定期启动Spark-job。

At the moment of this writing, it seems there is no official way of achieving this. 在撰写本文时,似乎没有正式的方法来实现这一目标。 So far, however, I have been able to somehow remotely run Spark jobs using an Oozie shell workflow. 然而,到目前为止,我已经能够以某种方式使用Oozie shell工作流远程运行Spark作业。 It is nothing but a patch, but so far it has been useful for me. 它只不过是一个补丁,但到目前为止它对我有用。 These are the steps I have followed: 这些是我遵循的步骤:

Prerequisites 先决条件

  • Microsoft Powershell 微软Powershell
  • Azure Powershell Azure Powershell

Process 处理

Define an Oozie workflow *.xml* file: 定义Oozie工作流* .xml *文件:

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2">
  <start to = "myAction"/>
  <action name="myAction">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>myScript.cmd</exec>
            <file>wasb://myContainer@myAccount.blob.core.windows.net/myScript.cmd#myScript.cmd</file>
            <file>wasb://myContainer@myAccount.blob.core.windows.net/mySpark.jar#mySpark.jar</file>
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>   

Note that it is not possible to identify on which HDInsight node is going to be executed the script, so it is necessary to put it, along with the Spark application .jar , on the wasb repository. 请注意,无法识别要在哪个HDInsight节点上执行脚本,因此有必要将它与Spark应用程序.jar一起放在wasb存储库中。 It is then redirectioned to the local directory on which the Oozie job is executing. 然后将其重定向到执行Oozie作业的本地目录。

Define the custom script 定义自定义脚本

C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass
                                          --master yarn-cluster 
                                          --deploy-mode cluster 
                                          --num-executors 3 
                                          --executor-memory 2g 
                                          --executor-cores 4 
                                          mySpark.jar  

It is necessary to upload both the .cmd and the Spark .jar to the wasb repository (a process that it is not included in this answer), concretely to the direction pointed in the workflow: 有必要将.cmdSpark .jar上传到wasb存储库(此答案中未包含的进程),具体到工作流中指向的方向:

wasb://myContainer@myAccount.blob.core.windows.net/

Define the powershell script 定义powershell脚本

The powershell script is very much taken from the official Oozie on HDInsight tutorial. powershell脚本非常取自官方Oozie的HDInsight教程。 I am not including the script on this answer due to its almost absolute sameness with my approach. 由于它与我的方法几乎完全相同,因此我不在此答案中包含脚本。

I have made a new suggestion on the azure feedback portal indicating the need of official support for remote Spark job submission. 我在天蓝色反馈门户上提出了一个新的建议,表明需要官方支持远程Spark作业提交。

Updated on 8/17/2016: Our spark cluster offering now includes a Livy server that provides a rest service to submit a spark job. 2016年8月17日更新:我们的火花群集现在包括一个Livy服务器,它提供休息服务以提交火花作业。 You can automate spark job via Azure Data Factory as well. 您也可以通过Azure Data Factory自动执行spark作业。


Original post: 1) Remote job submission for spark is currently not supported. 原帖:1)目前不支持针对spark的远程作业提交。

2) If you want to automate setting a master every time ( ie adding --master yarn-client every time you execute), you can set the value in %SPARK_HOME\\conf\\spark-defaults.conf file with following config: 2)如果您希望每次都自动设置主设备(即每次执行时添加--master yarn-client),您可以使用以下配置在%SPARK_HOME \\ conf \\ spark-defaults.conf文件中设置值:

spark.master yarn-client spark.master纱线客户端

You can find more info on spark-defaults.conf on apache spark website. 您可以在apache spark网站上找到有关spark-defaults.conf的更多信息。

3) Use cluster customization feature if you want to add this automatically to spark-defaults.conf file at deployment time. 3)如果要在部署时将其自动添加到spark-defaults.conf文件,请使用群集自定义功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在HDInsight群集中提交Spark作业以启用已启用安全保护的存储帐户 - Submitting Spark job in HDInsight cluster for secure enabled storage account 远程将作业提交到Azure HDInsight - Submit Job to Azure HDInsight Remotely 当提供 --py-files 时,--deploy-mode 集群中的 Azure HDInsight 中的 spark-submit 作业失败 - spark-submit job FAILS in Azure HDInsight in --deploy-mode cluster when --py-files are provided 从HDInsight群集头节点运行Spark应用程序 - Running spark application from HDInsight cluster headnode HDInsight:如何在Spark作业中使用更多核心 - HDInsight: How to use more cores in Spark job 如何通过Powershell在HDInsight上提交Spark作业? - How to submit a Spark job on HDInsight via Powershell? 如何在 Azure HDInsight 群集中创建 Pig Latin 作业 - How to Create a Pig Latin Job in Azure HDInsight Cluster HDInsight - 由于 -8 核的订阅限制,Spark 群集验证失败 - HDInsight - Spark cluster validation fails due to subscription limit of -8 cores 如何在Azure HDInsight Spark群集上使用自定义元数据文件 - How to use a custom metadata file on azure hdinsight spark cluster 在Azure HdInsight的Linux群集上的Spark中运行Zeppelin段落时出错 - Error while running Zeppelin paragraphs in Spark on Linux cluster in Azure HdInsight
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM