简体   繁体   中英

What is a good number of partitions in spark as a function of number of executors and threads?

I'm using Spark on EMR. I launch a cluster and sometimes the cluster is small (when writing/testing out code), say 5-10 instances. Other times executing the same code using a larger number of instances say 30-50.

I know I can access the configurations to assist in setting the number of partitions and choosing a good number of partitions helps the runtime.

I'd like to parameterize the number of partitions as a function of the number of executors and the number of threads:

val instanceCount = sc.getConf.get("spark.executor.instances").toDouble
val coreCount = sc.getConf.get("spark.executor.cores").toDouble

Has anyone looked into this and can give any advice on a good way to parametrize the number of partitions?

I realize there will not be one good answer but some functional form, with constants would help. For example:

val partitionCount = instanceCount*coreCount*0.7 

appears to work good in my use cases and describe your use cases (number/range of executors) that would be helpful.

In the answer if you could indicate the range of instances your working over that would be helpful too. If there is a canonical investigation of this somewhere a pointer to that would be helpful.

There is no single optimal configuration for all use-cases, but I will give you all the heuristics I have gathered along my experience with Spark.

More partitions than cores

First, let's state the obvious. You need to have more partitions (= tasks in a given stage) than you have cores, otherwise some cores will be sitting around doing nothing. Are there exceptions to this rule of thumb? Yes:

  • you could also be running multiple jobs in parallel. Say you have 1000 small datasets and you need to apply some transformations to each of them independently of the others. You probably do not want to partition each dataset in 128k files, however you could be running multiple jobs of 128 partitions in parallel to max out your number of cores. Note that I only know how to do that within a single step or on a custom-managed YARN cluster by setting spark.scheduler.mode=FAIR . I've never tried to submit parallel EMR steps and I don't know if it's possible, it's not a regular YARN concept (but again, you can do it within the same step if you wish).
  • your tasks are themselves parallelized. It's absolutely not a regular use-case and I don't recommend in general, but I have had to parallelize some MXNet classification code on Spark. The Java code creates a Python process that uses MXNet to predict and then returns the result to Java. Since MXNet is internally parallel and pretty good at using the cores, I've found throughput to be a lot higher by having as many machines as possible (so taking the smallest possible instances) and only two executors (containers) per machine. Each executor was creating a single MXNet process for serving 4 Spark tasks (partitions), and that was enough to max out my CPU usage. Without restricting the number of MXNet processes, the CPU was constantly pegged at 100% and wasting huge amounts of time in context switching.

Good amount of data per partition

Small partitions lead to slower jobs because there's a certain amount of communication between the driver and the slaves, and it does amount to a lot of time if you have 100 000 small tasks. If your tasks complete under 1 second your partitions are definitely too small.

Conversely, big partitions are memory hazards, especially during shuffles. Shuffles are very demanding on memory and will have you do a lot of garbage collection. If your partitions are too big, you will increase the risk of running out of memory or at best spending 50%+ of your time in GC. 2GB in serialized form is the absolute limit for a partition's size because Spark uses a Java IO utility that is backed by a byte array, which can only hold 2^31 - 1 (size of an int ) elements.

Generally, I recommend around 25MB-70MB in gzipped form if you're doing shuffles (mainly talking about JSON and textual data here).

Broadcasts

If you need to broadcast some objects to all your executors (for example, a Bloom filter for reducing the size of a dataset before shuffling it), the number of containers you want will be dictated by the amount of memory you are willing to use on each machine to hold your broadcasts. Indeed, the object will be broadcasted once per executor, so the amount of broadcasted data per machine is object_size * num_executors / num_ec2_instances assuming a homogeneous cluster. Network cost also increases with the number of containers since the object needs to be broadcasted multiple times to each EC2 instance.

However, I've had the case where my broadcasted object was a logistic model that was using some internal mutable state during classification. This meant that the predict method was synchronized and that all the threads in my container were fighting to access this lock. By increasing the number of containers (and thus the memory and network cost of my broadcast), I made the job 4 times faster.

Summary

The number of partitions is rather dictated by the size of the data than by the number of cores available. If your data doesn't require more than 200 partitions, then just do not take a cluster larger than 200 cores, you will probably not get any meaningful speed up from increasing the number of partitions and cores if the partitions are already reasonably well sized.

As long as your data is well-sized and your partitions balanced, the only remaining heuristics are:

  • use at least as many cores as partitions
  • run multiple jobs in parallel (if you can and) if you want to increase your throughput but your partitions are already well-sized
  • avoid running multi-threaded code within your tasks, but in the rare cases for which you need it, consider having fewer partitions than cores contrarily to what we said before.
  • the more containers you have, the more expensive broadcasting becomes (more network activity during the broadcast, and more memory usage until its destruction). If your broadcasted object is fully immutable, try having as little containers as possible. If your container has some internal state and requires locking, having too many threads per container might increase contention and slow you down.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM