繁体   English   中英

如何在Amazon EMR上运行Spark Scala代码

[英]How to run Spark Scala code on Amazon EMR

我试图在Amazon EMR上运行以Scala编写的以下Spark代码:

import org.apache.spark.{SparkConf, SparkContext}

object TestRunner {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Hello World")
    val sc = new SparkContext(conf)

    val words = sc.parallelize(Seq("a", "b", "c", "d", "e"))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    println(wordCounts)
  }
}

这是我用来将上述代码部署到EMR的脚本:

#!/usr/bin/env bash

set -euxo pipefail

cluster_id='j-XXXXXXXXXX'

app_name="HelloWorld"

main_class="TestRunner"
jar_name="HelloWorld-assembly-0.0.1-SNAPSHOT.jar"
jar_path="target/scala-2.11/${jar_name}"
s3_jar_dir="s3://jars/"
s3_jar_path="${s3_jar_dir}${jar_name}"
###################################################

sbt assembly

aws s3 cp ${jar_path} ${s3_jar_dir}

aws emr add-steps --cluster-id ${cluster_id} --steps Type=spark,Name=${app_name},Args=[--deploy-mode,cluster,--master,yarn-cluster,--class,${main_class},${s3_jar_path}],ActionOnFailure=CONTINUE

但是,退出几分钟后在AWS中根本不产生任何输出!

这是我的控制器的输出:

2016-10-20T21:03:17.043Z INFO Ensure step 3 jar file command-runner.jar
2016-10-20T21:03:17.043Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/mscheiber/HelloWorld-assembly-0.0.1-SNAPSHOT.jar'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-10-17-186-102
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3
INFO ProcessRunner started child process 24549 :
hadoop    24549   4780  0 21:03 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/TestRunner-assembly-0.0.1-SNAPSHOT.jar
2016-10-20T21:03:21.050Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3UAS8JQ0KEOV3 Child Pid: 24549
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 44 seconds
2016-10-20T21:04:03.102Z INFO Step created jobs: 
2016-10-20T21:04:03.103Z INFO Step succeeded with exitCode 0 and took 44 seconds

syslogstdout是空的,这是在我的stderr

16/10/20 21:03:20 INFO RMProxy: Connecting to ResourceManager at ip-10-17-186-102.ec2.internal/10.17.186.102:8032
16/10/20 21:03:21 INFO Client: Requesting a new application from cluster with 2 NodeManagers
16/10/20 21:03:21 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (53248 MB per container)
16/10/20 21:03:21 INFO Client: Will allocate AM container, with 53247 MB memory including 4840 MB overhead
16/10/20 21:03:21 INFO Client: Setting up container launch context for our AM
16/10/20 21:03:21 INFO Client: Setting up the launch environment for our AM container
16/10/20 21:03:21 INFO Client: Preparing resources for our AM container
16/10/20 21:03:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/20 21:03:22 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_libs__5484581201997889110.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_libs__5484581201997889110.zip
16/10/20 21:03:24 INFO Client: Uploading resource s3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/DataScience-assembly-0.0.1-SNAPSHOT.jar
16/10/20 21:03:24 INFO S3NativeFileSystem: Opening 's3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar' for reading
16/10/20 21:03:26 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_conf__5724047842379101980.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_conf__.zip
16/10/20 21:03:26 INFO SecurityManager: Changing view acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing view acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
16/10/20 21:03:26 INFO Client: Submitting application application_1476995377469_0002 to ResourceManager
16/10/20 21:03:26 INFO YarnClientImpl: Submitted application application_1476995377469_0002
16/10/20 21:03:27 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:27 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:28 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:29 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:30 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:31 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:31 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:32 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:33 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:34 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:35 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:36 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:37 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:38 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:39 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:40 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:41 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:42 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:43 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:44 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:45 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:46 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:47 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:48 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:49 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:50 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:51 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:52 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:53 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:54 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:55 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:56 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:57 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:58 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:59 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:04:00 INFO Client: Application report for application_1476995377469_0002 (state: FINISHED)
16/10/20 21:04:00 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: SUCCEEDED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:04:00 INFO Client: Deleting staging directory hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002
16/10/20 21:04:00 INFO ShutdownHookManager: Shutdown hook called
16/10/20 21:04:00 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4
Command exiting with ret '0'

我错过了什么?

看起来您的应用程序成功了。 但是,有两个原因导致您在步骤的粗壮日志中看不到任何输出。

1)您以纱线群集模式运行应用程序,这意味着驱动程序在随机群集节点上而不是在主节点上运行。 如果在创建群集时指定了S3 log uri,则应在S3存储桶的containers目录中看到此应用程序的日志。 驱动程序的日志将在容器#0的日志中。

2)你没有调用类似“collect()”之类的东西来将Spark执行器中的数据带回驱动程序,所以最后你的println()不打印数据,而是RDD的toString()表示。 您可能想要执行类似.collect()。foreach(println)的操作。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM