简体   繁体   English

作业图太大,无法提交到 Google Cloud Dataflow

[英]Job graph too large to submit to Google Cloud Dataflow

I am trying to run a job on Dataflow, and whenever I try to submit it to run with DataflowRunner, I receive the following errors from the service:我正在尝试在 Dataflow 上运行作业,每当我尝试提交它以使用 DataflowRunner 运行时,我都会从服务中收到以下错误:

{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request payload size exceeds the limit: x bytes.",
    "reason" : "badRequest"
  } ],
  "message" : "Request payload size exceeds the limit: x bytes.",
  "status" : "INVALID_ARGUMENT"
}
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "reason" : "badRequest",
    "debugInfo" : "detail: \"(3754670dbaa1cc6b): CreateJob fails due to Spanner error: New value exceeds the maximum size limit for this column in this database: Jobs.CloudWorkflowJob, size: 17278017, limit: 10485760.\"\n"
  } ],
  "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
  "status" : "INVALID_ARGUMENT"
}

How can I change my job to be smaller, or increase the job size limit?如何将我的工作更改为更小,或增加工作大小限制?

There is a workaround for this issue that will allow you to increase the size of your job graph to up to 100MB.此问题有一个解决方法,可让您将作业图的大小增加到最多 100MB。 You can specify this experiment: --experiments=upload_graph .您可以指定此实验: --experiments=upload_graph

The experiment activates a new submission path which uploads the job file to GCS, and creates the job via an HTTP request that does not contain the job graph - but simply a reference to it.实验激活了一个新的提交路径,该路径将作业文件上传到 GCS,并通过不包含作业图的 HTTP 请求创建作业 - 只是对其的引用。

This has the shortcoming that the UI may not be able to show the job proprely, as it relies on API requests to share the job.这样做的缺点是 UI 可能无法正确显示作业,因为它依赖于 API 请求来共享作业。


An extra note: It is still good practice to reduce the size of your job graph.额外说明:减小作业图的大小仍然是一种很好的做法。

An important tip is that sometimes it's possible to create some anonymous DoFns / lambda functions that will have a very large context in their closure, so I recommend looking into any closures in your code, and making sure they're not including very large contexts within themselves.一个重要提示是,有时可以创建一些匿名 DoFns / lambda 函数,它们的闭包中会有非常大的上下文,所以我建议查看代码中的任何闭包,并确保它们不包含非常大的上下文他们自己。

Perhaps avoiding anonymous lambdas/DoFns will help, as the context will be part of the class, rather than the serialized objects.也许避免匿名 lambdas/DoFns 会有所帮助,因为上下文将是 class 的一部分,而不是序列化对象。

I tried some of the suggestions above first, but it looks like the upload_graph experiment mentioned in the other answer got removed in later versions.我首先尝试了上面的一些建议,但看起来其他答案中提到的upload_graph实验在更高版本中被删除了。

What solved the issue for me when adding a map with 90k+ entries to an existing pipeline was to introduce this map as a MapSideInput.在将具有 90k+ 条目的 map 添加到现有管道时,为我解决问题的是将此 map 作为 MapSideInput 引入。 The general approach is documented in this example from the Scio framework I used to interact with Beam:一般方法记录在我用来与 Beam 交互的 Scio 框架的这个例子中:

https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html

There was no noticeable performance impact from this approach.这种方法没有明显的性能影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Cloud Dataflow 作业的调度 - Schedulling for Cloud Dataflow Job 如何使用自定义 Docker 图像运行 Python Google Cloud Dataflow 作业? - How to run a Python Google Cloud Dataflow job with a custom Docker image? Cloud Dataflow 中的失败作业:启用 Dataflow API - Failed job in Cloud Dataflow: enable Dataflow API 使用 Google Cloud Dataflow flex 模板时,是否可以使用多命令 CLI 来运行作业? - When using Google Cloud Dataflow flex templates, is it possible to use a multi-command CLI to run a job? GCP Dataflow 计算图和作业执行 - GCP Dataflow Computation Graph and Job Execution 一旦使用 apache 光束 sdk 在 Google Cloud 中创建数据流作业,我们可以从云存储桶中删除 tmp 文件吗? - Once dataflow job is created in Google Cloud using apache beam sdk, can we delete the tmp files from cloud storage bucket? 了解 Google Cloud DataFlow Worker 中的线程 - Understanding Threading in Google Cloud DataFlow Workers 关于谷歌云数据流权限的问题 - Question about permissions on Google Cloud Dataflow Google Dataflow API 按作业名称过滤 - Google Dataflow API Filter by Job Name 如何通过运行 Google Compute Engine cron 作业来安排数据流作业 - How to schedule Dataflow Job by running Google Compute Engine cron job
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM