Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
When running the following via spark-submit (spark.default.parallelism is not set)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
What can be the reason ?
Thanks.
spark.default.parallelism
defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism
.
When running spark-submit
, you're probably running it locally. Try submitting your spark-submit
with the same start up configs as you do the spark-shell.
Pulled this from the documentation:
spark.default.parallelism
For distributed shuffle operations like reduceByKey
and join
, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join
, reduceByKey
, and parallelize when not set by user.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.