简体   繁体   中英

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.

When running the following via spark-submit (spark.default.parallelism is not set)

val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)

It returns value 2 for partition size.

When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.

What can be the reason ?

Thanks.

spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism .

When running spark-submit , you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.

Pulled this from the documentation:

spark.default.parallelism

For distributed shuffle operations like reduceByKey and join , the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:

Local mode: number of cores on the local machine

Mesos fine grained mode: 8

Others: total number of cores on all executor nodes or 2, whichever is larger

Default number of partitions in RDDs returned by transformations like join , reduceByKey , and parallelize when not set by user.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM