简体繁体 English

GCP 数据流批处理作业 - 防止工作人员在批处理作业中一次运行多个元素

[英]GCP Dataflow Batch jobs - Preventing workers from running more than one element at a time in a batch job

原文 2021-08-26 14:07:10 4 1 python/ google-cloud-platform/ dataflow

I am trying to run a batch job in GCP dataflow.我正在尝试在 GCP 数据流中运行批处理作业。 The job itself is very memory intensive at times.工作本身有时非常密集 memory。 At the moment the job keeps crashing, as I believe each worker is trying to run multiple elements of the pcollection at the same time.目前工作一直在崩溃，因为我相信每个工作人员都试图同时运行 pcollection 的多个元素。 Is there a way to prevent each worker from running more than one element at a time?有没有办法防止每个工人一次运行多个元素？

1 个解决方案

The principle of Beam is to write a processing description and to let the runtime environment (here dataflow) running it and distributing it automatically. Beam的原理是写一个处理描述，让运行时环境（这里是dataflow）自动运行并分发。 You can't control what it is doing under the hood.你无法控制它在幕后所做的事情。

However you can try different things但是你可以尝试不同的东西

Create a window and trigger it each 1 element in the pane.创建一个 window 并在窗格中每 1 个元素触发它。 I don't know if it will help to distribute better the process in parallel, but you can have a try.我不知道这是否有助于更好地并行分配流程，但您可以尝试一下。
The other solution is to outsource the processing (if possible).另一种解决方案是将处理外包（如果可能）。 Create a Cloud Functions, or a Cloud Run (you can have up to 16Gb of memory and 4CPUs per Cloud Run instance) and set (for Cloud Run) the concurrency to 1 (to process only 1 request per instance and therefore, to have 16Gb dedicated to only one processing -> this behaviour (concurrency = 1) is by default with Cloud Functions).创建一个 Cloud Functions 或 Cloud Run（每个 Cloud Run 实例最多可以有 16Gb 的 memory 和 4 个 CPU）并将并发设置（对于 Cloud Run）为 1（每个实例仅处理 1 个请求，因此有 16Gb仅专用于一个处理 -> 此行为（并发性 = 1）默认情况下使用 Cloud Functions）。 In your dataflow job, perform an API call to this external service.在您的数据流作业中，执行对此外部服务的 API 调用。 However, you can have up to 1000 instances in parallel.但是，您最多可以同时拥有 1000 个实例。 If your workload required more, you can have 429 HTTP error code because of lack of resources.如果您的工作负载需要更多，您可能会因为资源不足而出现 429 HTTP 错误代码。
The latest solution is to wait the new and serverless runtime of Dataflow which scale the CPU and memory automatically without the "worker" object. It will be totally abstracted and the promise is to no longer have out of memory crash, However.最新的解决方案是等待数据流的新的无服务器运行时，它会自动扩展 CPU 和 memory，而无需“工人”object。它将被完全抽象化，promise 将不再发生 memory 崩溃，但是。 I don't know when it is planned.不知道什么时候计划。