在AWS EMR中使用spark-submit启动Python应用程序

Question

I'm new to Spark and having trouble replicating the example in the EMR docs for submitting a basic user application with spark-submit via AWS CLI. 我是Spark的新手，无法复制EMR文档中的示例，无法通过AWS CLI通过spark-submit基本用户应用程序。 It seems to run without error but produces no output. 它似乎运行没有错误但不产生输出。 Is something wrong with my syntax for add-steps in the below workflow? 我在下面的工作流程中添加步骤的语法有问题吗？

Example script 示例脚本

The objective is to count words in a document in S3, for this example 1000 words of lorem-ipsum: 目标是计算S3中文档中的单词，对于这个例子，lorem-ipsum的1000个单词：

$ aws s3 cp s3://projects/wordcount/input/some_document.txt - | head -n1
Lorem ipsum dolor sit amet, consectetur adipiscing [... etc.]

Copying from the docs, the python script looks like: 从文档中复制，python脚本如下所示：

$ aws s3 cp s3://projects/wordcount/wordcount.py -
from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount  ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="WordCount")
    text_file = sc.textFile(sys.argv[1])
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile(sys.argv[2])
    sc.stop()

The destination folder (currently empty): 目标文件夹（当前为空）：

$ aws s3 ls s3://projects/wordcount/output
                           PRE output/

create-cluster 创建集群

Working from doc , I have a running cluster with logging, as produced by: 从doc开始，我有一个带有日志记录的正在运行的集群，由以下产生：

aws emr create-cluster --name TestSparkCluster \
--release-label emr-5.11.0 --applications Name=Spark \
--enable-debugging --log-uri s3://projects/wordcount/log \
--instance-type m3.xlarge --instance-count 3 --use-default-roles

with return message showing created {"ClusterID": "j-XXXXXXXXXXXXX"} 显示已创建的返回消息{"ClusterID": "j-XXXXXXXXXXXXX"}

add-steps 添加步骤

Looking directly at the example , I submit add-steps as: 直接查看示例，我将add-steps提交为：

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=spark,Name=SparkWordCountApp,\
Args=[--deploy-mode,cluster,--master,yarn,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--num-executors,2,--executor-cores,2,--executor-memory,1g,\
s3://projects/wordcount/wordcount.py,\
s3://projects/wordcount/input/some_document.txt,\
s3://projects/wordcount/output/],\
ActionOnFailure=CONTINUE

which launches { "StepIds":["s-YYYYYYYYYYY"] } 启动{ "StepIds":["s-YYYYYYYYYYY"] }

The problem 问题

The output folder is empty -- why? 输出文件夹是空的 - 为什么？

I verify that step SparkWordCountApp with ID s-YYYYYYYYYYY has Status:Completed in the EMR console. 我验证具有ID s-YYYYYYYYYYY SparkWordCountApp步骤SparkWordCountApp在EMR控制台中具有Status:Completed 。

From the console, I check the controller log file and the stderr output (below) to verify that the step completed with exit status 0. 在控制台中，我检查控制器日志文件和stderr输出（如下所示）以验证步骤是否以退出状态0完成。

In the Spark documentation , a somewhat different syntax is used. 在Spark文档中，使用了稍微不同的语法。 Instead of making the script name the first position of the arguments list, it says: 它不是将脚本命名为参数列表的第一个位置，而是说：

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files. 对于Python应用程序，只需传递一个.py文件而不是JAR，并使用--py-files将Python .zip，.egg或.py文件添加到搜索路径中。

However, the example uses sys.argv , where sys.argv[0] is wordcount.py 但是，该示例使用sys.argv ，其中sys.argv [0]是wordcount.py

Additonal info: logs 附加信息：日志

Controller log file: 控制器日志文件：

2018-01-24T15:54:05.945Z INFO Ensure step 3 jar file command-runner.jar
2018-01-24T15:54:05.945Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-west-2
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-172-31-12-232
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-29XVS3IGMSK1
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1
INFO ProcessRunner started child process 20797 :
hadoop   20797  3347  0 15:54 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/
2018-01-24T15:54:09.956Z INFO HadoopJarStepRunner.Runner: startRun() called for s-29XVS3IGMSK1 Child Pid: 20797
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 16 seconds
2018-01-24T15:54:24.072Z INFO Step created jobs: 
2018-01-24T15:54:24.072Z INFO Step succeeded with exitCode 0 and took 16 seconds

Stderr log file: Stderr日志文件：

18/01/24 15:54:12 INFO RMProxy: Connecting to ResourceManager at ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal/XXX.YY.YY.ZZZ:8032
18/01/24 15:54:12 INFO Client: Requesting a new application from cluster with 2 NodeManagers
18/01/24 15:54:12 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
18/01/24 15:54:12 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/01/24 15:54:12 INFO Client: Setting up container launch context for our AM
18/01/24 15:54:12 INFO Client: Setting up the launch environment for our AM container
18/01/24 15:54:12 INFO Client: Preparing resources for our AM container
18/01/24 15:54:14 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/01/24 15:54:18 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_libs__8429498492477236801.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_libs__8429498492477236801.zip
18/01/24 15:54:22 INFO Client: Uploading resource s3://projects/wordcount/wordcount.py -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/wordcount.py
18/01/24 15:54:22 INFO S3NativeFileSystem: Opening 's3://projects/wordcount/wordcount.py' for reading
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/pyspark.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/py4j-0.10.4-src.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_conf__8267377904454396581.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_conf__.zip
18/01/24 15:54:22 INFO SecurityManager: Changing view acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing view acls groups to: 
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls groups to: 
18/01/24 15:54:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/01/24 15:54:22 INFO Client: Submitting application application_1516806627838_0002 to ResourceManager
18/01/24 15:54:23 INFO YarnClientImpl: Submitted application application_1516806627838_0002
18/01/24 15:54:23 INFO Client: Application report for application_1516806627838_0002 (state: ACCEPTED)
18/01/24 15:54:23 INFO Client: 
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: default
   start time: 1516809262990
   final status: UNDEFINED
   tracking URL: http://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:20888/proxy/application_1516806627838_0002/
   user: hadoop
18/01/24 15:54:23 INFO ShutdownHookManager: Shutdown hook called
18/01/24 15:54:23 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7
Command exiting with ret '0'

Answer 1

It turns out that the problem was caused by the destination folder already existing (even though empty). 事实证明，问题是由已存在的目标文件夹引起的（即使是空的）。 Deleting the output folder makes the example work. 删除输出文件夹使示例有效。

I figured this out by reading the step logs in S3 rather than the instance logs in the EMR console -- in those logs I saw an org.apache.hadoop.mapred.FileAlreadyExistsException which clued me in. 我通过读取S3中的步骤日志而不是EMR控制台中的实例日志来解决这个问题 - 在那些日志中，我看到了一个org.apache.hadoop.mapred.FileAlreadyExistsException ，它让我了解了。

A pre-existing S3 folder isn't a problem for other writing tasks I've done (eg, PigStorage ) so I wasn't expecting this. 对于我已经完成的其他写作任务（例如， PigStorage ），预先存在的S3文件夹不是问题所以我没想到这个。

I'm going to leave this question up in the (unlikely) event anyone else runs into this. 我将把这个问题留在（不太可能的）其他任何人遇到这个问题的事件中。

Answer 2

The following add-step worked for me, ran it from the master node: 以下add-step对我有用，从主节点运行它：

aws emr add-steps --cluster-id yourclusterid --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn, --conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,2,--executor-cores,2,--executor-memory,1g,s3://yourbucketname/yourcode.py,s3://yourbucketname/yourinputfile.txt,s3://yourbucketname/youroutputfile.out], ActionOnFailure=CONTINUE aws emr add-steps --cluster-id yourclusterid --steps Type = spark，Name = SparkWordCountApp，Args = [ - deploy-mode，cluster， - master，yarn， - conf，spark.yarn.submit.waitAppCompletion =假， - NUM-执行人，2， - 执行器型磁芯，2， - 执行存储器，1G，S3：//yourbucketname/yourcode.py,s3：//yourbucketname/yourinputfile.txt,s3：/ /yourbucketname/youroutputfile.out]，ActionOnFailure = CONTINUE

在AWS EMR中使用spark-submit启动Python应用程序

问题描述

Example script 示例脚本

create-cluster 创建集群

add-steps 添加步骤

The problem 问题

Additonal info: logs 附加信息：日志

2 个解决方案

解决方案1
2 2018-01-24 19:16:31

解决方案2
0 2019-07-19 20:21:22

在AWS EMR中使用spark-submit启动Python应用程序

问题描述

Example script 示例脚本

create-cluster 创建集群

add-steps 添加步骤

The problem 问题

Additonal info: logs 附加信息：日志

2 个解决方案

解决方案1 2 2018-01-24 19:16:31

解决方案2 0 2019-07-19 20:21:22

解决方案1
2 2018-01-24 19:16:31

解决方案2
0 2019-07-19 20:21:22