I am trying to run a script using spark submit as this
spark-submit -v \
--master yarn \
--num-executors 80 \
--driver-memory 10g \
--executor-memory 10g \
--executor-cores 5 \
--class cosineSimillarity jobs-1.0.jar
This script is implementing DIMSUM algorithm on 60K records.
Unfortunately this continues even after 3 hours. I tired with 1K data and runs successfully within 2min.
Can anyone recommend any changes to spark-submit params to make it faster?
Your spark-submit statement suggests that you have at least80*50=400 cores, right?
This means you should ensure that you have at least 400 partitions, to ensure that all your cores are working (ie each core hast at least 1 tasks to be processed).
Looking at the code you use, I think you should specify the number of partitions when reading the text-file in sc.textFile()
, AFAIK it defaults to 2 (see defaultMinPartitions
in SparkContext.scala)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.