Long running spark submit job

Question

I am trying to run a script using spark submit as this

spark-submit -v \
--master yarn \
--num-executors 80 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 5 \
--class cosineSimillarity jobs-1.0.jar

This script is implementing DIMSUM algorithm on 60K records.

Referring: https://github.com/eBay/Spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Unfortunately this continues even after 3 hours. I tired with 1K data and runs successfully within 2min.

Can anyone recommend any changes to spark-submit params to make it faster?

Answer 1

Your spark-submit statement suggests that you have at least80*50=400 cores, right?

This means you should ensure that you have at least 400 partitions, to ensure that all your cores are working (ie each core hast at least 1 tasks to be processed).

Looking at the code you use, I think you should specify the number of partitions when reading the text-file in sc.textFile() , AFAIK it defaults to 2 (see defaultMinPartitions in SparkContext.scala)

Long running spark submit job

Question

1 answers

solution1
0 2017-02-02 12:44:49

Long running spark submit job

Question

1 answers

solution1 0 2017-02-02 12:44:49

solution1
0 2017-02-02 12:44:49