简体   繁体   中英

How Apache spark calculates partitions and how partitions are processed in executor

I need some help to understand how Spark decides the number of partitions and how they are processed in executors, I am sorry for this question as I know it is a repetitive question, but still even after reading lots of articles I am not able to understand I am putting a real life use case on which currently I am working, along with my spark submit config and cluster config.

My Hardware config:

3 Node machine with total Vcores=30 and Total Memory=320 GB.

spark-submit config:

spark-submit \
--verbose \
--master yarn \
--deploy-mode cluster \
--num-executors 1  \
--executor-memory 3g \
--executor-cores 2 \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.yarn.am.attemptFailuresValidityInterval=1h \
--conf spark.driver.memory=1000m \
--conf spark.speculation=true \

I am reading from a MySql database using spark dataframe Jdbc api:

val jdbcTable= sqlContext.read.format("jdbc").options(
            Map(
              "url" -> jdcbUrl,
              "driver" -> "net.sourceforge.jtds.jdbc.Driver",
              "dbtable" ->
                s"(SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) as t"))
            .load

Total number of partitions created by jdbcTable DATAFRAME is 200

Questions:

  1. How did spark came up with 200 partitions, Is this a default setting?

  2. As I have only 1 executor does 200 partitions are processed in parallel in single executor or one partition is processed at a time?

  3. Do executor-cores are used to process task in each partition with the configured concurrency ie 2 (in my case)?

  • As it is written right now Spark will use 1 partition only .
  • If you see 200 partitions it means that:

    • There is subsequent shuffle (exchange) not shown in the code.
    • You use default value for spark.sql.shuffle.partitions .
  • Parallelism will depend on the execution plan and assigned resources. It won't be higher than min(number-partitions, spark-cores) . If there is a single executor it will number of threads assigned to this executor by the cluster manager.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM