简体   繁体   中英

Apache Spark: How partitions are processed in an executor

I have been working on a spark for a while, but some of the areas are still grey to me, If somebody can deep dive into this would be a great help.

1) If I have below spark submit configuration, and spark creates around 100 Partitions, How this partition are processed in single executor one by one or in parallel? What will be the case for > 1 executor.

--master yarn \
--deploy-mode cluster \
--num-executors 1  \
--executor-memory 3g \
--executor-cores 3 \

2) Can we control partition processing in spark?

3) I understand executor cores help to parallelize tasks in partitions, If I have a use case where I have a foreachPartition method where I am doing some processing of messages such as max and min and sending this message to Kafka, what role executor cores will play in this operation.

  1. Now the number of executors that you have specified is 1 and the executor cores is 3. So on your machine only one executor will run which will run a maximum of 3 tasks at the same time. The executor memory specifies the amount of data Spark can cache. So out of 100 partitions on one executor at maximum 3 can be processed in parallel.

  2. We can use the repartition method to change the partitions for an RDD in spark. Also reduceByKey and some other methods have an option to pass the number of partitions in the output RDD to be passed as an argument.

  3. I did not exactly understand your last question. But the executor cores will play the same role as mentioned above, to run tasks in parallel on one executor.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM