使用GPU的Spark：如何为每个执行程序强制执行1个任务

Question

I have Spark 2.1.0 running on a cluster with N slave nodes. 我在具有N个从属节点的集群上运行Spark 2.1.0。 Each node has 16 cores (8 cores/cpu and 2 cpus) and 1 GPU. 每个节点有16个内核（8个内核/ CPU和2个cpus）和1个GPU。 I want to use the map process to launch a GPU kernel. 我想使用map进程启动GPU内核。 Since there is only 1 GPU per node, I need to ensure that two executors are not on the same node (at the same time) trying to use the GPU and that two tasks are not submitted to the same executor at the same time. 由于每个节点只有1个GPU，我需要确保两个执行器不在同一个节点上（同时）尝试使用GPU，并且两个任务不会同时提交给同一个执行器。

How can I force Spark to have one executor per node? 如何强制Spark为每个节点创建一个执行程序？

I have tried the following: 我尝试过以下方法：

--Setting: spark.executor.cores 16 in $SPARK_HOME/conf/spark-defaults.conf - 设置： $SPARK_HOME/conf/spark-defaults.conf spark.executor.cores 16

--Setting: SPARK_WORKER_CORES = 16 and SPARK_WORKER_INSTANCES = 1 in $SPARK_HOME/conf/spark-env.sh - 设置： $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_CORES = 16和SPARK_WORKER_INSTANCES = 1

and, 和，

--Setting conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) directly in my spark script (when I wanted N =6 for debugging purposes). - 直接在我的spark脚本中conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) （当我想要N = 6进行调试时）。

These options create 6 executors on different nodes as desired, but it seems that each task is assigned to the same executor. 这些选项根据需要在不同节点上创建6个执行程序 ，但似乎每个任务都分配给同一个执行程序。

Here are some snippets from my most the most recent output (which lead me to believe it should be working as I want). 以下是我最近输出的一些片段（这让我相信它应该按照我的意愿工作）。

17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/0 on worker-20170217110853-10.128.14.208-35771 (10.128.14.208:35771) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/0 on hostPort 10.128.14.208:35771 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/1 on worker-20170217110853-10.128.9.95-59294 (10.128.9.95:59294) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/1 on hostPort 10.128.9.95:59294 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/2 on worker-20170217110853-10.128.3.71-47507 (10.128.3.71:47507) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/2 on hostPort 10.128.3.71:47507 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/3 on worker-20170217110853-10.128.9.96-50800 (10.128.9.96:50800) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/3 on hostPort 10.128.9.96:50800 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/4 on worker-20170217110853-10.128.3.73-60194 (10.128.3.73:60194) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/4 on hostPort 10.128.3.73:60194 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/5 on worker-20170217110853-10.128.3.74-42793 (10.128.3.74:42793) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/5 on hostPort 10.128.3.74:42793 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/1 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/3 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/4 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/2 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/0 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/5 is now RUNNING
17/02/17 11:09:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 

My RDD has 6 partitions.

The important thing is that 6 Executors were started, each with a different IP address and each getting 16 cores (exactly what I expected). 重要的是启动了6个Executors，每个执行器都有不同的IP地址，每个都有16个核心（正是我所期望的）。 The phrase My RDD has 6 partitions. 短语My RDD has 6 partitions. is a print statement from my code after repartitioning my RDD (to make sure I had 1 partition per executor). 是重新分区RDD后我的代码中的打印语句（以确保每个执行程序有1个分区）。

Then, THIS happens... each of the 6 tasks are sent to the same executor! 然后，这发生了...... 6个任务中的每一个都被发送到同一个执行器！

17/02/17 11:09:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/17 11:09:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.128.9.95:34059) with ID 1
17/02/17 11:09:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.128.9.95, executor 1, partition 0, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.128.9.95, executor 1, partition 1, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.128.9.95, executor 1, partition 2, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.128.9.95, executor 1, partition 3, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.128.9.95, executor 1, partition 4, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.128.9.95, executor 1, partition 5, PROCESS_LOCAL, 6095 bytes)

Why? 为什么？ and How can I fix it? 我该如何解决？ The problem is that at this point, all 6 tasks compete for the same GPU and the GPU cannot be shared. 问题在于，此时，所有6个任务都在竞争相同的GPU，并且GPU无法共享。

Answer 1

I tried the suggestion in the comments of Samson Scharfrichter, but they didn't seem to work. 我在Samson Scharfrichter的评论中尝试了这个建议，但它们似乎没有用。 However, I found: http://spark.apache.org/docs/latest/configuration.html#scheduling which includes spark.task.cpus . 但是，我发现： http ： spark.task.cpus包含spark.task.cpus 。 If I set that to 16 and spark.executor.cores to 16 then I appear to get one task assigned to each executor. 如果我将其设置为16并将spark.executor.cores设置为16，那么我似乎可以为每个执行程序分配一个任务。

使用GPU的Spark：如何为每个执行程序强制执行1个任务

问题描述

1 个解决方案

解决方案1
0 2017-03-27 12:42:27

使用GPU的Spark：如何为每个执行程序强制执行1个任务

问题描述

1 个解决方案

解决方案1 0 2017-03-27 12:42:27

解决方案1
0 2017-03-27 12:42:27