简体   繁体   中英

Tune Spark, set executors and memory driver for reading large csv file

I am wondering how to choose the best settings to run tune me Spark Job . Basically I am just reading a big csv file into a DataFrame and count some string occurrences.

The input file is over 500 GB. The spark job seems too slow..

terminal Progress Bar :

[Stage1:=======>                      (4174 + 50) / 18500]

NumberCompletedTasks: (4174) takes around one hour.

NumberActiveTasks: (50), I believe I can control with setting. --conf spark.dynamicAllocation.maxExecutors=50 (tried with different values).

TotalNumberOfTasks: (18500), why is this fixed? what does it mean, is it only related to file size? Since I am reading a csv just with little logic, how can I optimize the Spark Job?

I also tried changing :

 --executor-memory 10g 
 --driver-memory 12g 

任务数取决于源RDD的分区数,在您要从HDFS读取的情况下,块大小决定分区数,因此任务数就不会取决于执行程序数,如果您想要增加/减少需要更改分区的任务数,在您的情况下,您需要在读取时覆盖HDFS配置的min / maxSplit大小,对于现有的RDD,我们可以使用分区/合并来执行相同的操作。

Number of partitions = Number of tasks. If you have 18500 partitions, then spark will run 18500 tasks to process those.

Are you just reading the file and doing a filter out of it? Do you perform any Wide transformations? If you perform wide transformation, then the number of partition in the resulting RDD is controlled by the property "spark.sql.shuffle.partitions". If this is set to 18500, then your resultant RDD will have 18500 partitions and as a result 18500 tasks.

Secondly, spark.dynamicAllocation.maxExecutors represents the Upper bound for the number of executors if dynamic allocation is enabled. From what I can see, you have 5 nodes, 10 executors per node [Total 50 executors] and 1 core per executor [If you are running in YARN, then 1 core per executor is default].

To run your job faster: If possible reduce the number of shuffle partition via property spark.sql.shuffle.partitions and increase the number of core per executor [5 core per executor is the recommended one].

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM