简体   繁体   中英

Apache Spark standalone mode: number of cores

I'm trying to understand the basics of Spark internals and Spark documentation for submitting applications in local mode says for spark-submit --master setting:

local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).

local[*] Run Spark locally with as many worker threads as logical cores on your machine.

Since all the data is stored on a single local machine, it does not benefit from distributed operations on RDD s.

How does it benefit and what internally is going on when Spark utilizes several logical cores?

The system will allocate additional threads for processing data. Despite being limited to a single machine, it can still take advantage of the high degree of parallelism available in modern servers.

If you have a reasonable sized data set, say something with a dozen partitions, you can measure the time it takes to use local[1] vs local[n] (where n is the number of cores in your machine). You can also see the difference in utilization of your machine. If you only have one core designated for use, it will only use 100% of one core (plus some extra for garbage collection). If you have 4 cores, and specify local[4], it will use 400% of a core (4 cores). And execution time can be significantly shortened (although not typically by 4x).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM