spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Question

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

When running the following via spark-submit (spark.default.parallelism is not set)

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

What can be the reason ?

Thanks.

Answer 1

spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism .

When running spark-submit , you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.

Pulled this from the documentation:

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join , the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join , reduceByKey , and parallelize when not set by user.

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Question

1 answers

solution1
4 2016-02-13 23:37:51

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Question

1 answers

solution1 4 2016-02-13 23:37:51

solution1
4 2016-02-13 23:37:51