如何获得使用Google Cloud DataProc群集上所有可用资源的火花作业？

Question

For example, I currently have a DataProc cluster consisting of a master and 4 workers, each machine has 8 vCPUs and 30GB memory. 例如，我目前有一个DataProc集群，由一个主服务器和4个工作器组成，每台机器有8个vCPU和30GB内存。

Whenever I submit a job to the cluster, the cluster commits a max of 11GB total, and only engages 2 worker nodes to do the work, and on those nodes only uses 2 of the vCPU resources. 每当我向集群提交作业时，集群总共最多承诺11GB，并且仅使用2个工作节点来完成工作，并且在这些节点上仅使用2个vCPU资源。 This makes a job that should only take a few minutes take nearly an hour to execute. 这使得一项只需几分钟的工作需要花费近一个小时才能完成。

I have tried editing the spark-defaults.conf file on the master node, and have tried running my spark-submit command with the arguments --executor-cores 4 --executor-memory 20g --num-executors 4 but neither has had any effect. 我已经尝试在主节点上编辑spark-defaults.conf文件，并尝试使用参数运行我的spark-submit命令--executor-cores 4 --executor-memory 20g --num-executors 4但是都没有任何影响。

These clusters will only be spun up to perform a single task and will then be torn down, so the resources do not need to be held for any other jobs. 这些集群只会被分离以执行单个任务，然后将被拆除，因此不需要为任何其他作业保留资源。

Answer 1

我设法通过将调度程序更改为FIFO而不是FAIR来解决我的问题，使用下面的create命令结束：

--properties spark:spark.scheduler.mode=FIFO

Answer 2

You might want to see if what you're looking at is related to Dataproc set number of vcores per executor container - the number of vcores in-use reported by YARN is known to be incorrect, but it's only a cosmetic defect. 您可能想知道您所查看的内容是否与Dataproc设置的每个执行器容器的vcores数量相关 - 已知YARN报告的使用中的vcores数量不正确，但它只是一个外观缺陷。 On a Dataproc cluster with 8-core machines, the default configuration already does set 4 cores per executor; 在具有8核计算机的Dataproc群集上，默认配置已经为每个执行程序设置了4个核心; if you click through YARN to the Spark appmaster you should see that Spark is indeed able to pack 4 concurrent tasks per executor. 如果你点击YARN到Spark appmaster，你会发现Spark确实能够为每个执行程序打包4个并发任务。

That part explains what might look like "only using 2 vCPU" per node. 该部分解释了每个节点可能看起来“仅使用2个vCPU”。

The fact that the job only engages two of the worker nodes hints that there's more to it though; 这个工作只涉及两个工作节点的事实暗示了它还有更多的东西; the amount of parallelism you get is related to how well the data is partitioned . 您获得的并行度与数据分区的好坏程度有关。 If you have input files like gzip files that can't be split, then unfortunately there's not an easy way to increase input parallelism. 如果您有像无法拆分的gzip文件这样的输入文件，那么遗憾的是，增加输入并行性并不是一种简单的方法。 However, at least in later pipeline stages or if you do have splittable files, you can increase parallelism be specifying the number of Spark partitions at read time or by calling repartition in your code. 但是，至少在以后的管道阶段或者如果你有可拆分文件，你可以增加并行性，在读取时指定Spark分区的数量，或者在代码中调用repartition 。 Depending on your input size, you could also experiment with decreasing fs.gs.block.size ; 根据您的输入大小，您还可以尝试减少fs.gs.block.size ; that defaults to 134217728 (128MB) but you could set to half of that or a quarter of that or something either by setting it at cluster creation time: 默认为134217728 （128MB），但您可以通过在群集创建时设置它来设置为其中一半或四分之一或其中的一半：

--properties core:fs.gs.block.size=67108864

or at job submission time: 或在工作提交时间：

--properties spark.hadoop.fs.gs.block.size=67108864

如何获得使用Google Cloud DataProc群集上所有可用资源的火花作业？

问题描述

2 个解决方案

解决方案1
5 已采纳 2019-03-20 15:57:20

解决方案2
4 2019-03-20 14:49:21

如何获得使用Google Cloud DataProc群集上所有可用资源的火花作业？

问题描述

2 个解决方案

解决方案1 5 已采纳 2019-03-20 15:57:20

解决方案2 4 2019-03-20 14:49:21

解决方案1
5 已采纳 2019-03-20 15:57:20

解决方案2
4 2019-03-20 14:49:21