简体   繁体   中英

Long running spark submit job

I am trying to run a script using spark submit as this

spark-submit -v \
--master yarn \
--num-executors 80 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 5 \
--class cosineSimillarity jobs-1.0.jar

This script is implementing DIMSUM algorithm on 60K records.

Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Unfortunately this continues even after 3 hours. I tired with 1K data and runs successfully within 2min.

Can anyone recommend any changes to spark-submit params to make it faster?

Your spark-submit statement suggests that you have at least80*50=400 cores, right?

This means you should ensure that you have at least 400 partitions, to ensure that all your cores are working (ie each core hast at least 1 tasks to be processed).

Looking at the code you use, I think you should specify the number of partitions when reading the text-file in sc.textFile() , AFAIK it defaults to 2 (see defaultMinPartitions in SparkContext.scala)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM