I need some help to understand how Spark decides the number of partitions and how they are processed in executors, I am sorry for this question as I know it is a repetitive question, but still even after reading lots of articles I am not able to understand I am putting a real life use case on which currently I am working, along with my spark submit config and cluster config.
My Hardware config:
3 Node machine with total Vcores=30 and Total Memory=320 GB.
spark-submit config:
spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--num-executors 1 \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.driver.memory=1000m \
--conf spark.speculation=true \
I am reading from a MySql database using spark dataframe Jdbc api:
val jdbcTable= sqlContext.read.format("jdbc").options(
Map(
"url" -> jdcbUrl,
"driver" -> "net.sourceforge.jtds.jdbc.Driver",
"dbtable" ->
s"(SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) as t"))
.load
Total number of partitions created by jdbcTable DATAFRAME is 200
Questions:
How did spark came up with 200
partitions, Is this a default setting?
As I have only 1 executor does 200
partitions are processed in parallel in single executor or one partition is processed at a time?
Do executor-cores
are used to process task in each partition with the configured concurrency ie 2 (in my case)?
If you see 200 partitions it means that:
spark.sql.shuffle.partitions
. Parallelism will depend on the execution plan and assigned resources. It won't be higher than min(number-partitions, spark-cores)
. If there is a single executor it will number of threads assigned to this executor by the cluster manager.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.