简体   繁体   English

如何计算 Executor memory,No of executor,No of executor cores 和 Driver memory 使用 Spark 读取 40GB 文件?

[英]How to calculate the Executor memory,No of executor ,No of executor cores and Driver memory to read a file of 40GB using Spark?

Yarn Cluster Configuration: 8 Nodes 8 cores per Node 8 GB RAM per Node 1TB HardDisk per Node Yarn 集群配置: 8 个节点 每个节点 8 个内核 每个节点 8 GB RAM 每个节点 1TB 硬盘

Executor memory & No of Executors执行者 memory & 执行者数量

Executor memory and no of executors/node are interlinked so you would first start selecting Executor memory or No of executors then based on your choice you can follow this to set properties to get desired results执行器 memory 和没有执行器/节点是相互关联的,因此您首先开始选择执行器 memory 或执行器编号,然后根据您的选择,您可以按照此设置属性以获得所需的结果

In YARN these properties would affect number of containers (/executors in Spark) that can be instantiated in a NodeManager based on spark.executor.cores, spark.executor.memory property values (along with executor memory overhead)在 YARN 中,这些属性会影响可以在 NodeManager 中基于spark.executor.cores, spark.executor.memory属性值(以及执行器 memory 开销)实例化的容器数量(/Spark 中的执行程序)

For example, if a cluster with 10 nodes (RAM: 16 GB, cores: 6) and set with following yarn properties例如,如果集群有 10 个节点(RAM:16 GB,核心:6)并设置了以下纱线属性

yarn.scheduler.maximum-allocation-mb=10GB 
yarn.nodemanager.resource.memory-mb=10GB
yarn.scheduler.maximum-allocation-vcores=4
yarn.nodemanager.resource.cpu-vcores=4

Then with spark properties spark.executor.cores=2, spark.executor.memory=4GB you can expect 2 Executors/Node so total you'll get 19 executors + 1 container for Driver然后使用 spark 属性 spark.executor.cores=2, spark.executor.memory=4GB 您可以期望 2 个执行器/节点,因此总共您将获得 19 个执行器 + 1 个驱动程序容器

If the spark properties are spark.executor.cores=3, spark.executor.memory=8GB then you will get 9 Executor (only 1 Executor/Node) + 1 container for Driver link如果 spark 属性是spark.executor.cores=3, spark.executor.memory=8GB那么你将获得 9 个执行器(只有 1 个执行器/节点)+ 1 个用于驱动程序链接的容器

Driver memory驱动 memory

spark.driver.memory —Maximum size of each Spark driver's Java heap memory spark.driver.memory — 每个 Spark 驱动程序的最大大小 Java 堆 memory

spark.yarn.driver.memoryOverhead —Amount of extra off-heap memory that can be requested from YARN, per driver. spark.yarn.driver.memoryOverhead — 每个驱动程序可以从 YARN 请求的额外堆外 memory 的数量。 This, together with spark.driver.memory, is the total memory that YARN can use to create a JVM for a driver process.这与 spark.driver.memory 一起是总的 memory,YARN 可以使用它来为驱动程序进程创建 JVM。

Spark driver memory does not impact performance directly , but it ensures that the Spark jobs run without memory constraints at the driver . Spark驱动程序 memory 不会直接影响性能,但它确保Spark作业在驱动程序处不受 memory 约束的情况下运行 Adjust the total amount of memory allocated to a Spark driver by using the following formula, assuming the value of yarn.nodemanager.resource.memory-mb is X:假设yarn.nodemanager.resource.memory-mb的值为 X,使用以下公式调整分配给 Spark 驱动程序的 memory 的总量:

  • 12 GB when X is greater than 50 GB当 X 大于 50 GB 时为 12 GB
  • 4 GB when X is between 12 GB and 50 GB当 X 介于 12 GB 和 50 GB 之间时为 4 GB
  • 1 GB when X is between 1GB and 12 GB当 X 介于 1GB 和 12 GB 之间时为 1 GB
  • 256 MB when X is less than 1 GB当 X 小于 1 GB 时为 256 MB

These numbers are for the sum of spark.driver.memory and spark.yarn.driver.memoryOverhead .这些数字是spark.driver.memoryspark.yarn.driver.memoryOverhead的总和。 Overhead should be 10-15% of the total.开销应占总数的 10-15%。

You can also follow this Cloudera link for tuning Spark jobs您还可以按照此Cloudera 链接调整 Spark 作业

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM