调整Spark，设置执行程序和内存驱动程序以读取大型csv文件

Question

I am wondering how to choose the best settings to run tune me Spark Job . 我想知道如何选择最佳设置来运行我的Spark Job 。 Basically I am just reading a big csv file into a DataFrame and count some string occurrences. 基本上，我只是将一个较大的csv文件读入DataFrame并计算一些字符串出现的次数。

The input file is over 500 GB. 输入文件超过500 GB。 The spark job seems too slow.. 火花作业似乎太慢了。

terminal Progress Bar : 终端进度栏 ：

[Stage1:=======>                      (4174 + 50) / 18500]

NumberCompletedTasks: (4174) takes around one hour. NumberCompletedTasks: （4174）大约需要一个小时。

NumberActiveTasks: (50), I believe I can control with setting. NumberActiveTasks: （50），我相信可以通过设置进行控制。 --conf spark.dynamicAllocation.maxExecutors=50 (tried with different values). --conf spark.dynamicAllocation.maxExecutors=50 （尝试使用不同的值）。

TotalNumberOfTasks: (18500), why is this fixed? TotalNumberOfTasks: （18500），为什么要解决此问题？ what does it mean, is it only related to file size? 这是什么意思，它仅与文件大小有关吗？ Since I am reading a csv just with little logic, how can I optimize the Spark Job? 由于我只读很少的逻辑即可读取csv ，如何优化Spark Job？

I also tried changing : 我也尝试更改：

 --executor-memory 10g 
 --driver-memory 12g

Answer 1

任务数取决于源RDD的分区数，在您要从HDFS读取的情况下，块大小决定分区数，因此任务数就不会取决于执行程序数，如果您想要增加/减少需要更改分区的任务数，在您的情况下，您需要在读取时覆盖HDFS配置的min / maxSplit大小，对于现有的RDD，我们可以使用分区/合并来执行相同的操作。

Answer 2

Number of partitions = Number of tasks. 分区数=任务数。 If you have 18500 partitions, then spark will run 18500 tasks to process those. 如果您有18500个分区，那么spark将运行18500个任务来处理这些任务。

Are you just reading the file and doing a filter out of it? 您是在读取文件并对其进行过滤吗？ Do you perform any Wide transformations? 您执行任何广泛转换吗？ If you perform wide transformation, then the number of partition in the resulting RDD is controlled by the property "spark.sql.shuffle.partitions". 如果执行宽转换，则结果RDD中的分区数由属性“ spark.sql.shuffle.partitions”控制。 If this is set to 18500, then your resultant RDD will have 18500 partitions and as a result 18500 tasks. 如果将其设置为18500，那么您得到的RDD将具有18500分区，结果是18500任务。

Secondly, spark.dynamicAllocation.maxExecutors represents the Upper bound for the number of executors if dynamic allocation is enabled. 其次，spark.dynamicAllocation.maxExecutors代表启用了动态分配的执行程序数量的上限。 From what I can see, you have 5 nodes, 10 executors per node [Total 50 executors] and 1 core per executor [If you are running in YARN, then 1 core per executor is default]. 据我所知，您有5个节点，每个节点10个执行者[总共50个执行者]，每个执行者1个核心[如果您在YARN中运行，则默认每个执行者1个核心]。

To run your job faster: If possible reduce the number of shuffle partition via property spark.sql.shuffle.partitions and increase the number of core per executor [5 core per executor is the recommended one]. 要更快地执行工作，请执行以下操作：如果可能，请通过属性spark.sql.shuffle.partitions减少shuffle分区的数量，并增加每个执行程序的核心数[建议每个执行程序5个核心]。

调整Spark，设置执行程序和内存驱动程序以读取大型csv文件

问题描述

2 个解决方案

解决方案1
1 2017-12-06 08:08:19

解决方案2
1 2017-12-06 08:29:28

调整Spark，设置执行程序和内存驱动程序以读取大型csv文件

问题描述

2 个解决方案

解决方案1 1 2017-12-06 08:08:19

解决方案2 1 2017-12-06 08:29:28

解决方案1
1 2017-12-06 08:08:19

解决方案2
1 2017-12-06 08:29:28