简体   繁体   中英

Spark - change parallelism during execution

I have a job divided in two parts:

  • The first part retrieves data from HBase using Spark
  • The seoncd part computes heavy CPU intensive ML algorithms

The issue is that with high number of executors/cores, the HBase cluster is too aggresively queried and this may cause production unstability. With too few executors/cores, the ML computations takes a long time to perform.

As the number of executors and cores is set at startup, I would to know if there is a way to decrease executor number for the first part of the job.

I would obviously like to avoid running two separate jobs like Hadoop would do with mandary disk serialization between these two steps.

Thanks for your help

I guess dynamic allocation is what you are looking for. This something you can use with spark streaming as well.

I think you may have to play a little with your RDD size as well to balance data ingestion and data processing but depending on what's your real use case it can be really challenging.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM