简体   繁体   中英

GCP Dataflow Batch jobs - Preventing workers from running more than one element at a time in a batch job

I am trying to run a batch job in GCP dataflow. The job itself is very memory intensive at times. At the moment the job keeps crashing, as I believe each worker is trying to run multiple elements of the pcollection at the same time. Is there a way to prevent each worker from running more than one element at a time?

The principle of Beam is to write a processing description and to let the runtime environment (here dataflow) running it and distributing it automatically. You can't control what it is doing under the hood.

However you can try different things

  • Create a window and trigger it each 1 element in the pane. I don't know if it will help to distribute better the process in parallel, but you can have a try.
  • The other solution is to outsource the processing (if possible). Create a Cloud Functions, or a Cloud Run (you can have up to 16Gb of memory and 4CPUs per Cloud Run instance) and set (for Cloud Run) the concurrency to 1 (to process only 1 request per instance and therefore, to have 16Gb dedicated to only one processing -> this behaviour (concurrency = 1) is by default with Cloud Functions). In your dataflow job, perform an API call to this external service. However, you can have up to 1000 instances in parallel. If your workload required more, you can have 429 HTTP error code because of lack of resources.
  • The latest solution is to wait the new and serverless runtime of Dataflow which scale the CPU and memory automatically without the "worker" object. It will be totally abstracted and the promise is to no longer have out of memory crash, However. I don't know when it is planned.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM