通过 AWS [EMR] 提交 Spark 应用程序

Question

Hello I'm very new to cloud computing so I apologize for (maybe) the stupid question.你好，我对云计算很陌生，所以我为（也许）这个愚蠢的问题道歉。 I need help to know if what i do is actually computing on the cluster or just on the master (useless thing).我需要帮助才能知道我所做的实际上是在集群上计算还是仅在主机上计算（无用的东西）。

WHAT I CAN DO: Well I can set up a cluster of a certain number of nodes with Spark installed on all of them using the AWS console.我可以做什么：好吧，我可以使用 AWS 控制台设置一个包含一定数量节点的集群，并在所有节点上安装 Spark。 I can connect to the master node via SSH.我可以通过 SSH 连接到主节点。 What is it required then it to run my jar with Spark code on the cluster.那么它需要什么才能在集群上使用 Spark 代码运行我的 jar。

WHAT I'D DO: I'd call spark-submit to run my code:我会做什么：我会调用 spark-submit 来运行我的代码：

spark-submit --class cc.Main /home/ubuntu/MySparkCode.jar 3 [arguments]

MY DOUBTS:我的疑问：

Is it needed to specify the master with --master and the "spark://" reference of the master?是否需要使用 --master 和 master 的“spark://”引用来指定 master？ Where could I find that reference?我在哪里可以找到那个参考？ Should I run the script in sbin/start-master.sh to start a Standalone cluster manager or is it already set?我应该在 sbin/start-master.sh 中运行脚本来启动独立集群管理器还是已经设置了？ If I run the code above I imagine that code would run only locally on the master, right?如果我运行上面的代码，我想代码只会在主服务器上本地运行，对吗？
Can I keep my input files only on the master node?我可以只在主节点上保留我的输入文件吗？ Suppose i want to count the words of a huge text file, can I keep it only on the disk of the master?假设我要计算一个巨大的文本文件的字数，我可以将它只保存在master的磁盘上吗？ Or to maintain the parallelism I need a distributed memory like HDFS?或者为了保持并行性，我需要像 HDFS 这样的分布式内存？ I don't understand this, I'd keep it on the master node disk if it fits.我不明白这一点，如果合适，我会将其保留在主节点磁盘上。

So thanks for the reply.所以谢谢你的回复。

UPDATE1: I tried to run the Pi example on the cluster and I can't get the result. UPDATE1：我尝试在集群上运行 Pi 示例，但无法获得结果。

$ sudo spark-submit   --class org.apache.spark.examples.SparkPi   --master yarn   --deploy-mode cluster   /usr/lib/spark/examples/jars/spark-examples.jar   10

I would expect to get a line with printed Pi is roughly 3.14... instead I get:我希望得到一行打印的Pi is roughly 3.14...相反，我得到：

17/04/15 13:16:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/15 13:16:03 INFO RMProxy: Connecting to ResourceManager at ip-172-31-37-222.us-west-2.compute.internal/172.31.37.222:8032
17/04/15 13:16:03 INFO Client: Requesting a new application from cluster with 2 NodeManagers 
17/04/15 13:16:03 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
17/04/15 13:16:03 INFO Client: Will allocate AM container, with 5120 MB memory including 465 MB overhead
17/04/15 13:16:03 INFO Client: Setting up container launch context for our AM
17/04/15 13:16:03 INFO Client: Setting up the launch environment for our AM container
17/04/15 13:16:03 INFO Client: Preparing resources for our AM container
17/04/15 13:16:06 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/04/15 13:16:10 INFO Client: Uploading resource file:/mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9/__spark_libs__5838015067814081789.zip -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/__spark_libs__5838015067814081789.zip
17/04/15 13:16:12 INFO Client: Uploading resource file:/usr/lib/spark/examples/jars/spark-examples.jar -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/spark-examples.jar
17/04/15 13:16:12 INFO Client: Uploading resource file:/mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9/__spark_conf__1370316719712336297.zip -> hdfs://ip-172-31-37-222.us-west-2.compute.internal:8020/user/root/.sparkStaging/application_1492261407069_0007/__spark_conf__.zip
17/04/15 13:16:13 INFO SecurityManager: Changing view acls to: root
17/04/15 13:16:13 INFO SecurityManager: Changing modify acls to: root
17/04/15 13:16:13 INFO SecurityManager: Changing view acls groups to: 
17/04/15 13:16:13 INFO SecurityManager: Changing modify acls groups to: 
17/04/15 13:16:13 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()

17/04/15 13:16:13 INFO Client: Submitting application application_1492261407069_0007 to ResourceManager
17/04/15 13:16:13 INFO YarnClientImpl: Submitted application application_1492261407069_0007
17/04/15 13:16:14 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:14 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1492262173096
     final status: UNDEFINED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:15 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:24 INFO Client: Application report for application_1492261407069_0007 (state: ACCEPTED)
17/04/15 13:16:25 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:25 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.31.33.215
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1492262173096
     final status: UNDEFINED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:26 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:55 INFO Client: Application report for application_1492261407069_0007 (state: RUNNING)
17/04/15 13:16:56 INFO Client: Application report for application_1492261407069_0007 (state: FINISHED)
17/04/15 13:16:56 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 172.31.33.215
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1492262173096
     final status: SUCCEEDED
     tracking URL: http://ip-172-31-37-222.us-west-2.compute.internal:20888/proxy/application_1492261407069_0007/
     user: root
17/04/15 13:16:56 INFO ShutdownHookManager: Shutdown hook called
17/04/15 13:16:56 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-aa757ca0-4ff7-460c-8bee-27bc8c8dada9

Answer 1

Answer for 1st doubt:回答第一个问题：

I am assuming that you want to run spark on yarn.我假设你想在纱线上运行火花。 You can just pass --master yarn --deploy-mode cluster , the Spark driver runs inside an application master process which is managed by YARN on the cluster您可以只传递--master yarn --deploy-mode cluster ，Spark 驱动程序在由集群上的 YARN 管理的应用程序主进程内运行

spark-submit --master yarn  --deploy-mode cluster \
    --class cc.Main /home/ubuntu/MySparkCode.jar 3 [arguments]

Reference for other modes其他模式参考

when you run the job on --deploy-mode cluster you don't see the output(if you are printing something) on the machine where you run.当您在 --deploy-mode 集群上运行作业时，您在运行的机器上看不到输出（如果您正在打印某些内容）。

Reason: you are running a job on cluster mode hence master will be running on one of the nodes in cluster and output will be emitted on the same machine.原因：您正在集群模式下运行作业，因此 master 将在集群中的一个节点上运行，并且输出将在同一台机器上发出。

To check output you can see it in application log using the following command.要检查输出，您可以使用以下命令在应用程序日志中查看它。

yarn logs -applicationId application_id

Answer for 2nd doubt:回答第二个问题：

You can keep your input files anywhere(master node/HDFS).您可以将输入文件保存在任何地方（主节点/HDFS）。

parallelism completely depends on the number of partitions of RDD/DataFrame created when you load data.并行度完全取决于加载数据时创建的 RDD/DataFrame 的分区数量。 the number of partitions depends on data size though you can control by passing parameters when you load data.分区的数量取决于数据大小，但您可以在加载数据时通过传递参数来控制。

if you are loading data from master:如果您从 master 加载数据：

 val rdd =   sc.textFile("/home/ubumtu/input.txt",[number of partitions])

rdd will be created with the number of partitions you passed. rdd将使用您通过的分区数创建。 if you do not pass a number of partitions then it will consider spark.default.parallelism configured in spark conf.如果您没有通过多个分区，那么它将考虑在 spark conf 中配置的spark.default.parallelism 。

if you are loading data from HDFS:如果您从 HDFS 加载数据：

 val rdd =  sc.textFile("hdfs://namenode:8020/data/input.txt")

rdd will be created with the number of partitions which is equal to number blocks inside HDFS. rdd将创建的分区数等于 HDFS 内的块数。

Hope my answers helps you.希望我的回答对你有帮助。

Answer 2

你可以使用这个：

spark-submit --deploy-mode client --executor-memory 4g --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar

通过 AWS [EMR] 提交 Spark 应用程序

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-04-15 12:58:37

Answer for 1st doubt:回答第一个问题：

Answer for 2nd doubt:回答第二个问题：

解决方案2
0 2020-02-18 20:11:33

通过 AWS [EMR] 提交 Spark 应用程序

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-04-15 12:58:37

Answer for 1st doubt:回答第一个问题：

Answer for 2nd doubt:回答第二个问题：

解决方案2 0 2020-02-18 20:11:33

解决方案1
3 已采纳 2017-04-15 12:58:37

解决方案2
0 2020-02-18 20:11:33