简体   繁体   English

在AWS EMR中使用spark-submit启动Python应用程序

[英]launch Python app with spark-submit in AWS EMR

I'm new to Spark and having trouble replicating the example in the EMR docs for submitting a basic user application with spark-submit via AWS CLI. 我是Spark的新手,无法复制EMR文档中的示例 ,无法通过AWS CLI通过spark-submit基本用户应用程序。 It seems to run without error but produces no output. 它似乎运行没有错误但不产生输出。 Is something wrong with my syntax for add-steps in the below workflow? 我在下面的工作流程中添加步骤的语法有问题吗?

Example script 示例脚本

The objective is to count words in a document in S3, for this example 1000 words of lorem-ipsum: 目标是计算S3中文档中的单词,对于这个例子,lorem-ipsum的1000个单词:

$ aws s3 cp s3://projects/wordcount/input/some_document.txt - | head -n1
Lorem ipsum dolor sit amet, consectetur adipiscing [... etc.]

Copying from the docs, the python script looks like: 从文档中复制,python脚本如下所示:

$ aws s3 cp s3://projects/wordcount/wordcount.py -
from __future__ import print_function
from pyspark import SparkContext
import sys
if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: wordcount  ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="WordCount")
    text_file = sc.textFile(sys.argv[1])
    counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
    counts.saveAsTextFile(sys.argv[2])
    sc.stop()

The destination folder (currently empty): 目标文件夹(当前为空):

$ aws s3 ls s3://projects/wordcount/output
                           PRE output/

create-cluster 创建集群

Working from doc , I have a running cluster with logging, as produced by: doc开始 ,我有一个带有日志记录的正在运行的集群,由以下产生:

aws emr create-cluster --name TestSparkCluster \
--release-label emr-5.11.0 --applications Name=Spark \
--enable-debugging --log-uri s3://projects/wordcount/log \
--instance-type m3.xlarge --instance-count 3 --use-default-roles

with return message showing created {"ClusterID": "j-XXXXXXXXXXXXX"} 显示已创建的返回消息{"ClusterID": "j-XXXXXXXXXXXXX"}

add-steps 添加步骤

Looking directly at the example , I submit add-steps as: 直接查看示例 ,我将add-steps提交为:

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
--steps Type=spark,Name=SparkWordCountApp,\
Args=[--deploy-mode,cluster,--master,yarn,\
--conf,spark.yarn.submit.waitAppCompletion=false,\
--num-executors,2,--executor-cores,2,--executor-memory,1g,\
s3://projects/wordcount/wordcount.py,\
s3://projects/wordcount/input/some_document.txt,\
s3://projects/wordcount/output/],\
ActionOnFailure=CONTINUE

which launches { "StepIds":["s-YYYYYYYYYYY"] } 启动{ "StepIds":["s-YYYYYYYYYYY"] }

The problem 问题

The output folder is empty -- why? 输出文件夹是空的 - 为什么?

I verify that step SparkWordCountApp with ID s-YYYYYYYYYYY has Status:Completed in the EMR console. 我验证具有ID s-YYYYYYYYYYY SparkWordCountApp步骤SparkWordCountApp在EMR控制台中具有Status:Completed

From the console, I check the controller log file and the stderr output (below) to verify that the step completed with exit status 0. 在控制台中,我检查控制器日志文件和stderr输出(如下所示)以验证步骤是否以退出状态0完成。

In the Spark documentation , a somewhat different syntax is used. Spark文档中 ,使用了稍微不同的语法。 Instead of making the script name the first position of the arguments list, it says: 它不是将脚本命名为参数列表的第一个位置,而是说:

For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files. 对于Python应用程序,只需传递一个.py文件而不是JAR,并使用--py-files将Python .zip,.egg或.py文件添加到搜索路径中。

However, the example uses sys.argv , where sys.argv[0] is wordcount.py 但是,该示例使用sys.argv ,其中sys.argv [0]是wordcount.py

Additonal info: logs 附加信息:日志

Controller log file: 控制器日志文件:

2018-01-24T15:54:05.945Z INFO Ensure step 3 jar file command-runner.jar
2018-01-24T15:54:05.945Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-west-2
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  PYTHON_INSTALL_LAYOUT=amzn
  HOSTNAME=ip-172-31-12-232
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-29XVS3IGMSK1
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-29XVS3IGMSK1/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-29XVS3IGMSK1
INFO ProcessRunner started child process 20797 :
hadoop   20797  3347  0 15:54 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --master yarn --conf spark.yarn.submit.waitAppCompletion=false --num-executors 2 --executor-cores 2 --executor-memory 1g s3://projects/wordcount/wordcount.py s3://projects/wordcount/input/some_document.txt s3://projects/wordcount/output/
2018-01-24T15:54:09.956Z INFO HadoopJarStepRunner.Runner: startRun() called for s-29XVS3IGMSK1 Child Pid: 20797
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 16 seconds
2018-01-24T15:54:24.072Z INFO Step created jobs: 
2018-01-24T15:54:24.072Z INFO Step succeeded with exitCode 0 and took 16 seconds

Stderr log file: Stderr日志文件:

18/01/24 15:54:12 INFO RMProxy: Connecting to ResourceManager at ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal/XXX.YY.YY.ZZZ:8032
18/01/24 15:54:12 INFO Client: Requesting a new application from cluster with 2 NodeManagers
18/01/24 15:54:12 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
18/01/24 15:54:12 INFO Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
18/01/24 15:54:12 INFO Client: Setting up container launch context for our AM
18/01/24 15:54:12 INFO Client: Setting up the launch environment for our AM container
18/01/24 15:54:12 INFO Client: Preparing resources for our AM container
18/01/24 15:54:14 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/01/24 15:54:18 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_libs__8429498492477236801.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_libs__8429498492477236801.zip
18/01/24 15:54:22 INFO Client: Uploading resource s3://projects/wordcount/wordcount.py -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/wordcount.py
18/01/24 15:54:22 INFO S3NativeFileSystem: Opening 's3://projects/wordcount/wordcount.py' for reading
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/pyspark.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/py4j-0.10.4-src.zip
18/01/24 15:54:22 INFO Client: Uploading resource file:/mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7/__spark_conf__8267377904454396581.zip -> hdfs://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:8020/user/hadoop/.sparkStaging/application_1516806627838_0002/__spark_conf__.zip
18/01/24 15:54:22 INFO SecurityManager: Changing view acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls to: hadoop
18/01/24 15:54:22 INFO SecurityManager: Changing view acls groups to: 
18/01/24 15:54:22 INFO SecurityManager: Changing modify acls groups to: 
18/01/24 15:54:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/01/24 15:54:22 INFO Client: Submitting application application_1516806627838_0002 to ResourceManager
18/01/24 15:54:23 INFO YarnClientImpl: Submitted application application_1516806627838_0002
18/01/24 15:54:23 INFO Client: Application report for application_1516806627838_0002 (state: ACCEPTED)
18/01/24 15:54:23 INFO Client: 
   client token: N/A
   diagnostics: N/A
   ApplicationMaster host: N/A
   ApplicationMaster RPC port: -1
   queue: default
   start time: 1516809262990
   final status: UNDEFINED
   tracking URL: http://ip-XXX-YY-YY-ZZZ.us-west-2.compute.internal:20888/proxy/application_1516806627838_0002/
   user: hadoop
18/01/24 15:54:23 INFO ShutdownHookManager: Shutdown hook called
18/01/24 15:54:23 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-89654b91-c4db-4847-aa4b-22f27240daf7
Command exiting with ret '0'

It turns out that the problem was caused by the destination folder already existing (even though empty). 事实证明,问题是由已存在的目标文件夹引起的(即使是空的)。 Deleting the output folder makes the example work. 删除输出文件夹使示例有效。

I figured this out by reading the step logs in S3 rather than the instance logs in the EMR console -- in those logs I saw an org.apache.hadoop.mapred.FileAlreadyExistsException which clued me in. 我通过读取S3中的步骤日志而不是EMR控制台中的实例日志来解决这个问题 - 在那些日志中,我看到了一个org.apache.hadoop.mapred.FileAlreadyExistsException ,它让我了解了。

A pre-existing S3 folder isn't a problem for other writing tasks I've done (eg, PigStorage ) so I wasn't expecting this. 对于我已经完成的其他写作任务(例如, PigStorage ),预先存在的S3文件夹不是问题所以我没想到这个。

I'm going to leave this question up in the (unlikely) event anyone else runs into this. 我将把这个问题留在(不太可能的)其他任何人遇到这个问题的事件中。

The following add-step worked for me, ran it from the master node: 以下add-step对我有用,从主节点运行它:

aws emr add-steps --cluster-id yourclusterid --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn, --conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,2,--executor-cores,2,--executor-memory,1g,s3://yourbucketname/yourcode.py,s3://yourbucketname/yourinputfile.txt,s3://yourbucketname/youroutputfile.out], ActionOnFailure=CONTINUE aws emr add-steps --cluster-id yourclusterid --steps Type = spark,Name = SparkWordCountApp,Args = [ - deploy-mode,cluster, - master,yarn, - conf,spark.yarn.submit.waitAppCompletion =假, - NUM-执行人,2, - 执行器型磁芯,2, - 执行存储器,1G,S3://yourbucketname/yourcode.py,s3://yourbucketname/yourinputfile.txt,s3:/ /yourbucketname/youroutputfile.out],ActionOnFailure = CONTINUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM