简体   繁体   English

使用GPU的Spark:如何为每个执行程序强制执行1个任务

[英]Spark with GPUs: How to force 1 task per executor

I have Spark 2.1.0 running on a cluster with N slave nodes. 我在具有N个从属节点的集群上运行Spark 2.1.0。 Each node has 16 cores (8 cores/cpu and 2 cpus) and 1 GPU. 每个节点有16个内核(8个内核/ CPU和2个cpus)和1个GPU。 I want to use the map process to launch a GPU kernel. 我想使用map进程启动GPU内核。 Since there is only 1 GPU per node, I need to ensure that two executors are not on the same node (at the same time) trying to use the GPU and that two tasks are not submitted to the same executor at the same time. 由于每个节点只有1个GPU,我需要确保两个执行器不在同一个节点上(同时)尝试使用GPU,并且两个任务不会同时提交给同一个执行器。

How can I force Spark to have one executor per node? 如何强制Spark为每个节点创建一个执行程序?

I have tried the following: 我尝试过以下方法:

--Setting: spark.executor.cores 16 in $SPARK_HOME/conf/spark-defaults.conf - 设置: $SPARK_HOME/conf/spark-defaults.conf spark.executor.cores 16

--Setting: SPARK_WORKER_CORES = 16 and SPARK_WORKER_INSTANCES = 1 in $SPARK_HOME/conf/spark-env.sh - 设置: $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_CORES = 16SPARK_WORKER_INSTANCES = 1

and, 和,

--Setting conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) directly in my spark script (when I wanted N =6 for debugging purposes). - 直接在我的spark脚本中conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) (当我想要N = 6进行调试时) 。

These options create 6 executors on different nodes as desired, but it seems that each task is assigned to the same executor. 这些选项根据需要在不同节点上创建6个执行程序 ,但似乎每个任务都分配给同一个执行程序。

Here are some snippets from my most the most recent output (which lead me to believe it should be working as I want). 以下是我最近输出的一些片段(这让我相信它应该按照我的意愿工作)。

17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/0 on worker-20170217110853-10.128.14.208-35771 (10.128.14.208:35771) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/0 on hostPort 10.128.14.208:35771 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/1 on worker-20170217110853-10.128.9.95-59294 (10.128.9.95:59294) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/1 on hostPort 10.128.9.95:59294 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/2 on worker-20170217110853-10.128.3.71-47507 (10.128.3.71:47507) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/2 on hostPort 10.128.3.71:47507 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/3 on worker-20170217110853-10.128.9.96-50800 (10.128.9.96:50800) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/3 on hostPort 10.128.9.96:50800 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/4 on worker-20170217110853-10.128.3.73-60194 (10.128.3.73:60194) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/4 on hostPort 10.128.3.73:60194 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/5 on worker-20170217110853-10.128.3.74-42793 (10.128.3.74:42793) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/5 on hostPort 10.128.3.74:42793 with 16 cores, 16.0 GB RAM 
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/1 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/3 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/4 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/2 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/0 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/5 is now RUNNING
17/02/17 11:09:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 

My RDD has 6 partitions.

The important thing is that 6 Executors were started, each with a different IP address and each getting 16 cores (exactly what I expected). 重要的是启动了6个Executors,每个执行器都有不同的IP地址,每个都有16个核心(正是我所期望的)。 The phrase My RDD has 6 partitions. 短语My RDD has 6 partitions. is a print statement from my code after repartitioning my RDD (to make sure I had 1 partition per executor). 是重新分区RDD后我的代码中的打印语句(以确保每个执行程序有1个分区)。

Then, THIS happens... each of the 6 tasks are sent to the same executor! 然后, 发生了...... 6个任务中的每一个都被发送到同一个执行器!

17/02/17 11:09:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/17 11:09:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.128.9.95:34059) with ID 1
17/02/17 11:09:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.128.9.95, executor 1, partition 0, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.128.9.95, executor 1, partition 1, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.128.9.95, executor 1, partition 2, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.128.9.95, executor 1, partition 3, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.128.9.95, executor 1, partition 4, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.128.9.95, executor 1, partition 5, PROCESS_LOCAL, 6095 bytes)

Why? 为什么? and How can I fix it? 我该如何解决? The problem is that at this point, all 6 tasks compete for the same GPU and the GPU cannot be shared. 问题在于,此时,所有6个任务都在竞争相同的GPU,并且GPU无法共享。

I tried the suggestion in the comments of Samson Scharfrichter, but they didn't seem to work. 我在Samson Scharfrichter的评论中尝试了这个建议,但它们似乎没有用。 However, I found: http://spark.apache.org/docs/latest/configuration.html#scheduling which includes spark.task.cpus . 但是,我发现: httpspark.task.cpus包含spark.task.cpus If I set that to 16 and spark.executor.cores to 16 then I appear to get one task assigned to each executor. 如果我将其设置为16并将spark.executor.cores设置为16,那么我似乎可以为每个执行程序分配一个任务。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何强制 Spark 执行器为每个任务生成更多线程? - how can I force Spark executor to spawn more threads per task? 如何在 Spark 中强制每个执行程序只处理一个任务? - How to enforce processing only one task per executor in Spark? 每个Kubernetes节点运行多少个Spark Executor Pod - How many Spark Executor Pods you run per Kubernetes Node Spark reparition()函数增加了每个执行者的任务数量,如何增加执行者的数量 - Spark reparition() function increases number of tasks per executor, how to increase number of executor Spark独立集群,每个执行程序的内存问题 - Spark Standalone cluster, memory per executor issue 在Spark中为每个执行器创建数组并合并到RDD中 - Creating array per Executor in Spark and combine into RDD 每个执行器每批次的Spark结构化流式打印偏移 - Spark Structured Streaming Print Offsets Per Batch Per Executor pyspark spark.executor.memory是每个核心还是每个节点? - pyspark spark.executor.memory is per core or per node? 为什么Spark每个执行器只使用一个核心? 它如何决定使用除分区数以外的核心? - Why Spark utilizing only one core per executor? How it decides to utilize cores other than number of partitions? 如果我们减少每个执行程序的内核数量并增加执行程序的数量,spark 如何管理 IO 性能 - How spark manages IO perfomnce if we reduce the number of cores per executor and incease number of executors
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM