Spark EMR：需要配置多个 spark-submits 才能在 EMR 集群中并行工作

Question

I am new to AWS EMR and I face a problem of making multiple spark-submits to run parallel.我是 AWS EMR 的新手，我面临使多个 spark-submit 并行运行的问题。

I have some jobs scheduled to run every 10 minutes and a job that runs every 6 hours.我有一些作业计划每 10 分钟运行一次，而一个作业每 6 小时运行一次。 The cluster has enough resources to run them all at the same time, but the default config puts them into the single root.集群有足够的资源同时运行它们，但默认配置将它们放入单个根目录中。 default queue that makes them run sequentially.使它们按顺序运行的默认队列。 This is not what I want.这不是我想要的。 What do I have to write in configuration files?我必须在配置文件中写什么？

I've tried adding queues "1", "2" and "3" to root queue (in yarn-site.xml) and spark-submitting each job into a separate queue.我尝试将队列“1”、“2”和“3”添加到根队列（在 yarn-site.xml 中），并将每个作业提交到单独的队列中。 But they still run sequentially (not parallel as I want).但它们仍然按顺序运行（不是我想要的并行）。

spark-submit --queue 1 --num-executors 1  s3://bucket/some-job.py

spark-submit --queue 2 --num-executors 1  s3://bucket/some-job.py

Answer 1

I've found an unexpected solution.我找到了一个意想不到的解决方案。

Seems like current Hadoop documentation does not reflect correct configs for yarn in AWS EMR, so I've used trial and error method to find a way to make it work.似乎当前的 Hadoop 文档没有反映 AWS EMR 中纱线的正确配置，因此我使用试错法来找到使其工作的方法。

Instead of Capacity scheduler I used Fair scheduler.我使用了公平调度程序而不是容量调度程序。 But it still places all apps in same queue ("pool"), so I had to manually schedule each job into separate queue and configure those queues to eat appropriate amount of resources.但它仍然将所有应用程序放在同一个队列（“池”）中，因此我必须手动将每个作业安排到单独的队列中，并配置这些队列以占用适当数量的资源。 This is what I've done:这就是我所做的：

yarn-site.xml纱线站点.xml

<property>
   <name>yarn.resourcemanager.scheduler.class</name>
  <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>

<property>
<name>yarn.scheduler.fair.allocation.file</name>
<value>fair-scheduler.xml</value>
</property>

<property>
<name>yarn.scheduler.fair.preemption</name>
<value>true</value>
</property>

The purpose of Fair scheduler - to schedule tasks and distribute resources fairly.公平调度器的目的 - 公平地调度任务和分配资源。 But (surprise lmao) it does not do it if there is no preemption enabled.但是（惊讶lmao）如果没有启用抢占，它不会这样做。 The first task eats all resources and does not give them away until it finishes if you explicitly don't ASK FOR THEM.第一个任务会吃掉所有资源，并且如果您明确不要求他们，则在完成之前不会放弃它们。

That's how I managed preemption:这就是我管理抢占的方式：

fair-scheduler.xml公平调度程序.xml

<allocations>
  <pool name="smalltask">
    <schedulingMode>FAIR</schedulingMode>
    <maxRunningApps>4</maxRunningApps>
    <weight>1</weight>
    <fairSharePreemptionThreshold>0.4</fairSharePreemptionThreshold>
    <fairSharePreemptionTimeout>1</fairSharePreemptionTimeout>
  </pool>

  <pool name="bigtask">
    <schedulingMode>FAIR</schedulingMode>
    <maxRunningApps>2</maxRunningApps>
    <fairSharePreemptionThreshold>0.6</fairSharePreemptionThreshold>
    <fairSharePreemptionTimeout>1</fairSharePreemptionTimeout>
    <weight>2</weight>
  </pool>
</allocations>

Now I have 2 queues (big and small), small queue can run 4 small tasks, and big queue can run 2 big tasks at the same time.现在我有2个队列（大和小），小队列可以运行4个小任务，大队列可以同时运行2个大任务。 Big queue weights more, so it demands more resources.大队列权重更大，因此需要更多资源。 If small queue takes more than 40% of resources, other queues begin "nationalizing" it and take resources away.如果小队列占用了超过 40% 的资源，其他队列开始将其“国有化”并带走资源。 Same for big queue (60%).大队列 (60%) 也是如此。 What happens inside each queue I don't know for sure, but seems like resources try to be distributed equally between apps.我不确定每个队列中会发生什么，但似乎资源试图在应用程序之间平均分配。

My happy new year wish would be detailed documentation of hadoop and EMR.我新年快乐的愿望是有关 hadoop 和 EMR 的详细文档。

Answer 2

I have done both but none has worked for me.我做了两个，但没有一个对我有用。

Closest is CAPACITY SCHEDULER.最接近的是 CAPACITY SCHEDULER。 I have 4 queues, default, gold, silver, bronze.我有 4 个队列，默认，金，银，铜。 Individual spark submit (in fact , 4 in a series that is my unit) work well with each queue with a --queue gold|silver option .单个 spark 提交（实际上，一个系列中的 4 个是我的单位）使用--queue gold|silver选项与每个队列配合得--queue gold|silver 。

But when I run 2 set (1 in gold and 1 in silver, say) both of them eventually hangs.但是当我运行 2 组时（比如说 1 组金色和 1 组银色），它们最终都挂了。

Regards, Suvo问候，苏沃

Answer 3

Simply, give the configuration for the queues.简单地，给出队列的配置。 When you create the EMR, you can give the configurations for yarn.scheduler.创建 EMR 时，您可以为 yarn.scheduler 提供配置。 If you have preferred configurations, then specify it.如果您有首选配置，请指定它。 For example , 例如，

"Classification": "capacity-scheduler",
"Properties": {
     "yarn.scheduler.capacity.root.queues": "default, gold, silver, bronze"
}

this will give you several queue chennels.这会给你几个队列通道。

Another option is to modify the EMR where it is already running.另一种选择是修改已经运行的 EMR。 Similar to the above, but it can be done by AWS CLI or other SDKs.与上述类似，但可以通过 AWS CLI 或其他 SDK 来完成。 See the article . 见文章。

It uses the command它使用命令

aws emr modify-instance-groups --cli-input-json file://some.json

with some.json as a form such as:使用some.json作为一种形式，例如：

{
   "ClusterId":"j-MyClusterID",
   "InstanceGroups":[
      {
         "InstanceGroupId":"ig-MyMasterId",
         "Configurations":[
            {
               "Classification":"capacity-scheduler",
               "Properties":{
                  "yarn.scheduler.capacity.root.queues":"default, bronze, silver, gold"
               },
               "Configurations":[]
            }
         ]
      }
   ]
}

Spark EMR：需要配置多个 spark-submits 才能在 EMR 集群中并行工作

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-09-19 11:28:13

解决方案2
1 2020-07-25 16:25:53

解决方案3
0 2019-09-17 11:56:24

Spark EMR：需要配置多个 spark-submits 才能在 EMR 集群中并行工作

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-09-19 11:28:13

解决方案2 1 2020-07-25 16:25:53

解决方案3 0 2019-09-17 11:56:24

解决方案1
1 已采纳 2019-09-19 11:28:13

解决方案2
1 2020-07-25 16:25:53

解决方案3
0 2019-09-17 11:56:24