PySpark PandasUDF on GCP - Memory Allocation

Question

I am using a pandas udf to train many ML models on GCP in Dataproc (Spark). The main idea is that I have a grouping variable that represents the various sets of data in my data frame and I run something like this:

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def test_train(grp_df):
    
  #train model on grp_df
  #evaluate model 
  #return metrics on 
 
    return (metrics)

result=df.groupBy('group_id').apply(test_train)

This works fine except when I use the non-sampled data, where errors are returned that appear to be related to memory issues. The messages are cryptic (to me) but if I sample down the data it runs, if I dont, it fails. Error messages are things like:

OSError: Read out of bounds (offset = 631044336, size = 69873416) in file of size 573373864

or

Container killed by YARN for exceeding memory limits. 24.5 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

My Question is how to set memory in the cluster to get this to work?

I understand that each group of data and the process being ran needs to fit entirely in the memory of the executor. I current have a 4-worker cluster with the following:

If I think the maximum size of data in the largest group_id requires 150GB of memory, it seems I really need each machine to operate on one group_id at a time. At least I get 4 times the speed compared to having a single worker or VM.

If I do the following, is this in fact creating 1 executor per machine that has access to all the cores minus 1 and 180 GB of memory? So that if in theory the largest group of data would work on a single VM with this much RAM, this process should work?

spark = SparkSession.builder \
  .appName('test') \
  .config('spark.executor.memory', '180g') \
  .config('spark.executor.cores', '63') \
  .config('spark.executor.instances', '1') \
  .getOrCreate()

Answer 1

Let's break the answer into 3 parts:

Number of executors
The GroupBy operation
Your executor memory

Number of executors

Straight from the Spark docs :

 spark.executor.instances

 Initial number of executors to run if dynamic allocation is enabled.
 If `--num-executors` (or `spark.executor.instances`) is set and larger
 than this value, it will be used as the initial number of executors.

So, No . You only get a single executor which won't scale up unless dynamic allocation is enabled.

You can increase the number of such executors manually by configuring spark.executor.instances or setup automatic scale up based on workload, by enabling dynamic executor allocation.

To enable dynamic allocation, you have to also enable the shuffle service which allows you to safely remove executors. This can be done by setting two configs:

spark.shuffle.service.enabled to true . Default is false.
spark.dynamicAllocation.enabled to true . Default is false.

GroupBy

I have observed group_by being done using hash aggregates in Spark which means given x number of partitions, and unique group_by values greater than x , multiple group by values will lie in the same partition.

For example, say two unique values in group_by column are a1 and a2 having total rows' size 100GiB and 150GiB respectively.

If they fall into separate partitions, your application will run fine since each partition will fit into the executor memory (180GiB), which is required for in-memory processing and the remaining will be spilled to disk if they do not fit into the remaining memory. However, if they fall into same partition, your partition will not fit into the executor memory (180GiB < 250GiB) and you will get an OOM.

In such instances, it's useful to configure spark.default.parallelism to distribute your data over a reasonably larger number of partitions or apply salting or other techniques to remove data skewness.

If your data is not too skewed, you are correct to say that as long as your executor can handle the largest groupby value, it should work since your data will be evenly partitioned and chances of the above happening will be rare.

Another point to note is that since you are using group_by which requires data shuffle, you should also turn on the shuffle service. Without the shuffle service, each executor has to serve the shuffle requests along with doing it's own work.

Executor memory

The total executor memory (actual executor container size) in Spark is determined by adding the executor memory alloted for container along with the alloted memoryOverhead . The memoryOverhead accounts for things like VM overheads, interned strings, other native overheads, etc. So,

Total executor memory = (spark.executor.memory + spark.executor.memoryOverhead)
spark.executor.memoryOverhead = max(executorMemory*0.10, 384 MiB)

Based on this, you can configure your executors to have an appropriate size as per your data. So, when you set the spark.executor.memory to 180GiB , the actual executor launched should be of around 198GiB .

Answer 2

To Resolve yarn overhead issue you can increase yarn overhead memory by adding .config('spark.yarn.executor.memoryOverhead','30g') and for maximum parallelism it is recommended to keep no of cores to 5 where as you can increase the no of executors.


spark = SparkSession.builder \
  .appName('test') \
  .config('spark.executor.memory', '18g') \
  .config('spark.executor.cores', '5') \
  .config('spark.executor.instances', '12') \
  .getOrCreate()  

# or use dynamic resource allocation refer below config 

spark = SparkSession.builder \
    .appName('test') \
   .config('spark.shuffle.service.enabled':'true')\
   .config('spark.dynamicAllocation.enabled':'true')\
   .getOrCreate()

Answer 3

I solved OSError: Read out of bounds **** by making group number large

result=df.groupBy('group_id').apply(test_train)

PySpark PandasUDF on GCP - Memory Allocation

Question

3 answers

solution1
2 2020-10-05 05:10:25

solution2
0 2020-10-02 11:46:45

solution3
0 2022-05-28 11:42:09

PySpark PandasUDF on GCP - Memory Allocation

Question

3 answers

solution1 2 2020-10-05 05:10:25

solution2 0 2020-10-02 11:46:45

solution3 0 2022-05-28 11:42:09

solution1
2 2020-10-05 05:10:25

solution2
0 2020-10-02 11:46:45

solution3
0 2022-05-28 11:42:09