I'm using Spark on EMR. I launch a cluster and sometimes the cluster is small (when writing/testing out code), say 5-10 instances. Other times executing the same code using a larger number of instances say 30-50.
I know I can access the configurations to assist in setting the number of partitions and choosing a good number of partitions helps the runtime.
I'd like to parameterize the number of partitions as a function of the number of executors and the number of threads:
val instanceCount = sc.getConf.get("spark.executor.instances").toDouble
val coreCount = sc.getConf.get("spark.executor.cores").toDouble
Has anyone looked into this and can give any advice on a good way to parametrize the number of partitions?
I realize there will not be one good answer but some functional form, with constants would help. For example:
val partitionCount = instanceCount*coreCount*0.7
appears to work good in my use cases and describe your use cases (number/range of executors) that would be helpful.
In the answer if you could indicate the range of instances your working over that would be helpful too. If there is a canonical investigation of this somewhere a pointer to that would be helpful.
There is no single optimal configuration for all use-cases, but I will give you all the heuristics I have gathered along my experience with Spark.
First, let's state the obvious. You need to have more partitions (= tasks in a given stage) than you have cores, otherwise some cores will be sitting around doing nothing. Are there exceptions to this rule of thumb? Yes:
spark.scheduler.mode=FAIR
. I've never tried to submit parallel EMR steps and I don't know if it's possible, it's not a regular YARN concept (but again, you can do it within the same step if you wish). Small partitions lead to slower jobs because there's a certain amount of communication between the driver and the slaves, and it does amount to a lot of time if you have 100 000 small tasks. If your tasks complete under 1 second your partitions are definitely too small.
Conversely, big partitions are memory hazards, especially during shuffles. Shuffles are very demanding on memory and will have you do a lot of garbage collection. If your partitions are too big, you will increase the risk of running out of memory or at best spending 50%+ of your time in GC. 2GB in serialized form is the absolute limit for a partition's size because Spark uses a Java IO utility that is backed by a byte array, which can only hold 2^31 - 1
(size of an int
) elements.
Generally, I recommend around 25MB-70MB in gzipped form if you're doing shuffles (mainly talking about JSON and textual data here).
If you need to broadcast some objects to all your executors (for example, a Bloom filter for reducing the size of a dataset before shuffling it), the number of containers you want will be dictated by the amount of memory you are willing to use on each machine to hold your broadcasts. Indeed, the object will be broadcasted once per executor, so the amount of broadcasted data per machine is object_size * num_executors / num_ec2_instances
assuming a homogeneous cluster. Network cost also increases with the number of containers since the object needs to be broadcasted multiple times to each EC2 instance.
However, I've had the case where my broadcasted object was a logistic model that was using some internal mutable state during classification. This meant that the predict method was synchronized and that all the threads in my container were fighting to access this lock. By increasing the number of containers (and thus the memory and network cost of my broadcast), I made the job 4 times faster.
The number of partitions is rather dictated by the size of the data than by the number of cores available. If your data doesn't require more than 200 partitions, then just do not take a cluster larger than 200 cores, you will probably not get any meaningful speed up from increasing the number of partitions and cores if the partitions are already reasonably well sized.
As long as your data is well-sized and your partitions balanced, the only remaining heuristics are:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.