简体   繁体   English

如何获得使用Google Cloud DataProc群集上所有可用资源的火花作业?

[英]How do I get a spark job to use all available resources on a Google Cloud DataProc cluster?

For example, I currently have a DataProc cluster consisting of a master and 4 workers, each machine has 8 vCPUs and 30GB memory. 例如,我目前有一个DataProc集群,由一个主服务器和4个工作器组成,每台机器有8个vCPU和30GB内存。

Whenever I submit a job to the cluster, the cluster commits a max of 11GB total, and only engages 2 worker nodes to do the work, and on those nodes only uses 2 of the vCPU resources. 每当我向集群提交作业时,集群总共最多承诺11GB,并且仅使用2个工作节点来完成工作,并且在这些节点上仅使用2个vCPU资源。 This makes a job that should only take a few minutes take nearly an hour to execute. 这使得一项只需几分钟的工作需要花费近一个小时才能完成。

I have tried editing the spark-defaults.conf file on the master node, and have tried running my spark-submit command with the arguments --executor-cores 4 --executor-memory 20g --num-executors 4 but neither has had any effect. 我已经尝试在主节点上编辑spark-defaults.conf文件,并尝试使用参数运行我的spark-submit命令--executor-cores 4 --executor-memory 20g --num-executors 4但是都没有任何影响。

These clusters will only be spun up to perform a single task and will then be torn down, so the resources do not need to be held for any other jobs. 这些集群只会被分离以执行单个任务,然后将被拆除,因此不需要为任何其他作业保留资源。

我设法通过将调度程序更改为FIFO而不是FAIR来解决我的问题,使用下面的create命令结束:

--properties spark:spark.scheduler.mode=FIFO

You might want to see if what you're looking at is related to Dataproc set number of vcores per executor container - the number of vcores in-use reported by YARN is known to be incorrect, but it's only a cosmetic defect. 您可能想知道您所查看的内容是否与Dataproc设置的每个执行器容器的vcores数量相关 - 已知YARN报告的使用中的vcores数量不正确,但它只是一个外观缺陷。 On a Dataproc cluster with 8-core machines, the default configuration already does set 4 cores per executor; 在具有8核计算机的Dataproc群集上,默认配置已经为每个执行程序设置了4个核心; if you click through YARN to the Spark appmaster you should see that Spark is indeed able to pack 4 concurrent tasks per executor. 如果你点击YARN到Spark appmaster,你会发现Spark确实能够为每个执行程序打包4个并发任务。

That part explains what might look like "only using 2 vCPU" per node. 该部分解释了每个节点可能看起来“仅使用2个vCPU”。

The fact that the job only engages two of the worker nodes hints that there's more to it though; 这个工作只涉及两个工作节点的事实暗示了它还有更多的东西; the amount of parallelism you get is related to how well the data is partitioned . 您获得的并行度与数据分区的好坏程度有关。 If you have input files like gzip files that can't be split, then unfortunately there's not an easy way to increase input parallelism. 如果您有像无法拆分的gzip文件这样的输入文件,那么遗憾的是,增加输入并行性并不是一种简单的方法。 However, at least in later pipeline stages or if you do have splittable files, you can increase parallelism be specifying the number of Spark partitions at read time or by calling repartition in your code. 但是,至少在以后的管道阶段或者如果你有可拆分文件,你可以增加并行性,在读取时指定Spark分区的数量,或者在代码中调用repartition Depending on your input size, you could also experiment with decreasing fs.gs.block.size ; 根据您的输入大小,您还可以尝试减少fs.gs.block.size ; that defaults to 134217728 (128MB) but you could set to half of that or a quarter of that or something either by setting it at cluster creation time: 默认为134217728 (128MB),但您可以通过在群集创建时设置它来设置为其中一半或四分之一或其中的一半:

--properties core:fs.gs.block.size=67108864

or at job submission time: 或在工作提交时间:

--properties spark.hadoop.fs.gs.block.size=67108864

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在使用Google Cloud Dataproc启动的Spark群集中启用Snappy编解码器支持? - How do I enable Snappy codec support in a Spark cluster launched with Google Cloud Dataproc? 基本上使用哪个调度程序在 Google Cloud Dataproc 集群中提交 Spark 作业? - Which Scheduler to use basically to submit spark job in Google Cloud Dataproc Cluster? 如何在Google Cloud Dataproc群集上一起使用jupyter,pyspark和cassandra - How to use jupyter, pyspark and cassandra together on google cloud dataproc cluster 如何让 PySpark 在 Google Cloud Dataproc 集群上工作 - How to get PySpark working on Google Cloud Dataproc cluster 我在哪里配置 dataproc 集群中 spark 作业的 spark 执行器和执行器 memory? - Where do I configure spark executors and executor memory of a spark job in a dataproc cluster? 在google-dataproc的Spark集群中的pyspark作业中使用外部库 - use an external library in pyspark job in a Spark cluster from google-dataproc Dataproc Cluster上是否提供Spark UI? - Spark UI available on Dataproc Cluster? Google Cloud Logging中的Dataproc Spark作业输出 - Output from Dataproc Spark job in Google Cloud Logging 最后阶段Spark on Google Cloud Dataproc作业失败 - Spark on Google Cloud Dataproc job failures on last stages 数据处理; Spark 作业在 Dataproc Spark 集群上失败,但在本地运行 - Dataproc; Spark job fails on Dataproc Spark cluster, but runs locally
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM