简体   繁体   中英

Shuffle Stage Failing Due To Executor Loss

I get the following error when my spark jobs fails **"org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 21), which maintains the block data to fetch is dead."**

Over view of my spark job

input size is ~35 GB

I have broadcast joined all the smaller tables with the mother table into say a dataframe1 and then i salted each big table and dataframe1 before i join it with dataframe1 (left table).

profile used:

@configure(profile=[
     'EXECUTOR_MEMORY_LARGE',
     'NUM_EXECUTORS_32',
     'DRIVER_MEMORY_LARGE',
     'SHUFFLE_PARTITIONS_LARGE'
])

using the above approach and profiles i was able to get the runtime down by 50% but i still get Shuffle Stage Failing Due To Executor Loss issues.

is there a way i can fix this?

There are multiple things you can try:

  1. Broadcast Joins: If you have used broadcast hints to join multiple smaller tables, then the resulting table (of many smaller tables) might be too huge to be accommodated in each executor memory. So, you need to look at total size of dataframe1.
  2. 35GB is really not huge. Also try the profile "EXECUTOR_CORES_MEDIUM", which really increases the parallelism in data computation. Use Dynamic allocation (16 executors should be fine for 35GB) rather than static allocation. If 32 executors are not available at a time, the build doesn't start. "DRIVER_MEMORY_MEDIUM" should be enough.
  3. Spark 3.0 handles skew joins by itself with Adaptive Query Execution. So, you need not use salting technique. There is a profile called "ADAPTIVE_ENABLED" with foundry that you can use. Other settings of adaptive query execution, you will have to set manually with "ctx" spark context object readily available with Foundry.

Some references for AQE: https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/aqe https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM