I used the Standalone Spark Cluster
to process several files. When I executed the Driver, the data was processed on each worker using it's cores.
Now, I've read about Partitions
, but I didn't get it if it's different than Worker Cores or not.
Is there a difference between setting cores number
and partition numbers
?
Simplistic view: Partition vs Number of Cores
When you invoke an action an RDD,
Partition (or task) refers to a unit of work. If you have a 200G hadoop file loaded as an RDD and chunked by 128M (Spark default), then you have ~2000 partitions in this RDD. The number of cores determines how many partitions can be processed at any one time, and up to 2000 (capped at the number of partitions/tasks) can execute this RDD in parallel.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.