简体繁体 English

GCP Dataflow 计算图和作业执行

[英]GCP Dataflow Computation Graph and Job Execution

原文 2021-08-14 01:18:58 0 1 google-cloud-platform/ google-cloud-dataflow/ pipeline/ apache-beam/ computation-graph

Hi Everyone I tried hard to understand what is happening when I create a custom template in Google cloud Dataflow but failed to understand.大家好，当我在谷歌云数据流中创建自定义模板时，我努力了解发生了什么，但未能理解。 Thanks to GCP documentations.感谢 GCP 文档。 Below is what I am achieving.以下是我正在实现的目标。

Read Data from Google cloud Bucket从谷歌云桶中读取数据
Pre-Process it预处理它
Load Deeplearning models (1 GB each) and get the predictions加载深度学习模型（每个 1 GB）并获得预测
Dump the results in BigQuery.将结果转储到 BigQuery 中。

I successfully created the template and I am able to execute the job.我成功创建了模板并且能够执行作业。 But I have following questions.但我有以下问题。

When I execute the job, Everytime the models (5 models and each of 1GB) gets downloaded during execution OR the models are loaded and placed in the template (Execution Graph) and during execution it uses the loaded ones当我执行作业时，每次在执行期间下载模型（5 个模型，每个 1GB）或加载模型并将其放置在模板（执行图）中，并在执行期间使用加载的模型
If loading of the models happen only during the job execution, then does it not impact the execution time?如果仅在作业执行期间加载模型，那么它不会影响执行时间吗？ Since it has to load GBs of Model files everytime the job is triggered?因为每次触发作业时它都必须加载 GB 的 Model 个文件？
Can multiple users trigger the same template at same time?多个用户可以同时触发同一个模板吗？ Since I want to productionize it, I am not sure how this will handle multiple requests at same time?因为我想生产它，我不确定这将如何同时处理多个请求？

Can anyone please share some information on it?任何人都可以分享一些信息吗？

Sources I referred and failed to get the answer: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#pipeline-lifecycle-from-pipeline-code-to-dataflow-job http://alumni.media.mit.edu/~wad/magiceight/isa/node3.html https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options#configuring-pipelineoptions-for-local-execution https://beam.apache.org/documentation/basics/ https://beam.apache.org/documentation/runtime/model/ https://mehmandarov.com/apache-beam-pipeline-graph/我引用但未能得到答案的来源： https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#pipeline-lifecycle-from-pipeline-code-to-dataflow-job http:/ /alumni.media.mit.edu/~wad/magiceight/isa/node3.html https://cloud.google.com/dataflow/docs/guides/setting-pipeline-options#configuring-pipelineoptions-for-local-execution https://beam.apache.org/documentation/basics/ https://beam.apache.org/documentation/runtime/model/ https://mehmandarov.com/apache-beam/-pipeline-graph

1 个解决方案

This depends on where the models are being loaded from.这取决于从哪里加载模型。 If they're loaded in the DoFns (most likely), then it will happen in the workers (during job execution).如果它们加载到 DoFns 中（最有可能），那么它将发生在工作人员中（在作业执行期间）。

As for your other question, there should be no issues with multiple users triggering a template job simultaneously.至于您的其他问题，多个用户同时触发模板作业应该没有问题。