简体繁体 English

Spark中如何处理executor内存和driver内存？

[英]How to deal with executor memory and driver memory in Spark?

原文 2014-11-28 04:07:00 1 3 memory-management/ apache-spark

I am confused about dealing with executor memory and driver memory in Spark.我对处理 Spark 中的执行程序内存和驱动程序内存感到困惑。

My environment settings are as below:我的环境设置如下：

Memory 128 G, 16 CPU for 9 VM内存 128 G, 16 CPU 9 VM
Centos Centos
Hadoop 2.5.0-cdh5.2.0 Hadoop 2.5.0-cdh5.2.0
Spark 1.1.0火花 1.1.0

Input data information:输入数据信息：

3.5 GB data file from HDFS来自 HDFS 的 3.5 GB 数据文件

For simple development, I executed my Python code in standalone cluster mode (8 workers, 20 cores, 45.3 G memory) with spark-submit .为了简单的开发，我使用spark-submit在独立集群模式（8 个工作线程、20 个内核、45.3 G 内存）下执行了我的 Python 代码。 Now I would like to set executor memory or driver memory for performance tuning.现在我想设置执行程序内存或驱动程序内存以进行性能调整。

From the Spark documentation , the definition for executor memory is从Spark 文档中，执行程序内存的定义是

Amount of memory to use per executor process, in the same format as JVM memory strings (eg 512m, 2g).每个执行程序进程使用的内存量，格式与 JVM 内存字符串相同（例如 512m、2g）。

How about driver memory?驱动内存呢？

3 个解决方案

The memory you need to assign to the driver depends on the job.您需要分配给驱动程序的内存取决于作业。

If the job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile, rdd.saveToCassandra, ... then the memory needs of the driver will be very low.如果作业纯粹基于转换并在某些分布式输出操作（如 rdd.saveAsTextFile、rdd.saveToCassandra 等）上终止，那么驱动程序的内存需求将非常低。 Few 100's of MB will do.很少有 100 的 MB 会做。 The driver is also responsible of delivering files and collecting metrics, but not be involved in data processing.驱动程序还负责传递文件和收集指标，但不参与数据处理。

If the job requires the driver to participate in the computation , like eg some ML algo that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver.如果作业需要驱动程序参与计算，例如某些 ML 算法需要实现结果并在下一次迭代中广播它们，那么您的作业将取决于通过驱动程序的数据量。 Operations like .collect , .take and takeSample deliver data to the driver and hence, the driver needs enough memory to allocate such data. .collect 、 .take和takeSample将数据传递给驱动程序，因此，驱动程序需要足够的内存来分配这些数据。

eg If you have an rdd of 3GB in the cluster and call val myresultArray = rdd.collect , then you will need 3GB of memory in the driver to hold that data plus some extra room for the functions mentioned in the first paragraph.例如，如果集群中有 3GB 的rdd并调用val myresultArray = rdd.collect ，那么驱动程序中将需要 3GB 的内存来保存该数据，并为第一段中提到的函数提供一些额外的空间。

In a Spark Application, Driver is responsible for task scheduling and Executor is responsible for executing the concrete tasks in your job.在 Spark 应用程序中，Driver 负责任务调度，Executor 负责执行作业中的具体任务。

If you are familiar with MapReduce, your map tasks & reduce tasks are all executed in Executor(in Spark, they are called ShuffleMapTasks & ResultTasks), and also, whatever RDD you want to cache is also in executor's JVM's heap & disk.如果你熟悉 MapReduce，你的 map 任务和 reduce 任务都是在 Executor 中执行的（在 Spark 中，它们被称为 ShuffleMapTasks 和 ResultTasks），而且，无论你想要缓存的 RDD 也在 executor 的 JVM 的堆和磁盘中。

So I think a few GBs will just be OK for your Driver.所以我认为几 GB 对您的驱动程序来说就可以了。

Spark shell required memory = (Driver Memory + 384 MB) + (Number of executors * (Executor memory + 384 MB)) Spark shell 所需内存 =（驱动程序内存 + 384 MB）+（执行程序数量 *（执行程序内存 + 384 MB））

Here 384 MB is maximum memory (overhead) value that may be utilized by Spark when executing jobs.这里 384 MB 是 Spark 在执行作业时可能使用的最大内存（开销）值。