作业图太大，无法提交到 Google Cloud Dataflow

Question

I am trying to run a job on Dataflow, and whenever I try to submit it to run with DataflowRunner, I receive the following errors from the service:我正在尝试在 Dataflow 上运行作业，每当我尝试提交它以使用 DataflowRunner 运行时，我都会从服务中收到以下错误：

{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request payload size exceeds the limit: x bytes.",
    "reason" : "badRequest"
  } ],
  "message" : "Request payload size exceeds the limit: x bytes.",
  "status" : "INVALID_ARGUMENT"
}
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "reason" : "badRequest",
    "debugInfo" : "detail: \"(3754670dbaa1cc6b): CreateJob fails due to Spanner error: New value exceeds the maximum size limit for this column in this database: Jobs.CloudWorkflowJob, size: 17278017, limit: 10485760.\"\n"
  } ],
  "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
  "status" : "INVALID_ARGUMENT"
}

How can I change my job to be smaller, or increase the job size limit?如何将我的工作更改为更小，或增加工作大小限制？

Answer 1

There is a workaround for this issue that will allow you to increase the size of your job graph to up to 100MB.此问题有一个解决方法，可让您将作业图的大小增加到最多 100MB。 You can specify this experiment: --experiments=upload_graph .您可以指定此实验： --experiments=upload_graph 。

The experiment activates a new submission path which uploads the job file to GCS, and creates the job via an HTTP request that does not contain the job graph - but simply a reference to it.实验激活了一个新的提交路径，该路径将作业文件上传到 GCS，并通过不包含作业图的 HTTP 请求创建作业 - 只是对其的引用。

This has the shortcoming that the UI may not be able to show the job proprely, as it relies on API requests to share the job.这样做的缺点是 UI 可能无法正确显示作业，因为它依赖于 API 请求来共享作业。

An extra note: It is still good practice to reduce the size of your job graph.额外说明：减小作业图的大小仍然是一种很好的做法。

An important tip is that sometimes it's possible to create some anonymous DoFns / lambda functions that will have a very large context in their closure, so I recommend looking into any closures in your code, and making sure they're not including very large contexts within themselves.一个重要提示是，有时可以创建一些匿名 DoFns / lambda 函数，它们的闭包中会有非常大的上下文，所以我建议查看代码中的任何闭包，并确保它们不包含非常大的上下文他们自己。

Perhaps avoiding anonymous lambdas/DoFns will help, as the context will be part of the class, rather than the serialized objects.也许避免匿名 lambdas/DoFns 会有所帮助，因为上下文将是 class 的一部分，而不是序列化对象。

Answer 2

I tried some of the suggestions above first, but it looks like the upload_graph experiment mentioned in the other answer got removed in later versions.我首先尝试了上面的一些建议，但看起来其他答案中提到的upload_graph实验在更高版本中被删除了。

What solved the issue for me when adding a map with 90k+ entries to an existing pipeline was to introduce this map as a MapSideInput.在将具有 90k+ 条目的 map 添加到现有管道时，为我解决问题的是将此 map 作为 MapSideInput 引入。 The general approach is documented in this example from the Scio framework I used to interact with Beam:一般方法记录在我用来与 Beam 交互的 Scio 框架的这个例子中：

https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html

There was no noticeable performance impact from this approach.这种方法没有明显的性能影响。

作业图太大，无法提交到 Google Cloud Dataflow

问题描述

2 个解决方案

解决方案1
0 已采纳 2021-03-01 22:38:49

解决方案2
0 2022-09-28 22:05:44

作业图太大，无法提交到 Google Cloud Dataflow

问题描述

2 个解决方案

解决方案1 0 已采纳 2021-03-01 22:38:49

解决方案2 0 2022-09-28 22:05:44

解决方案1
0 已采纳 2021-03-01 22:38:49

解决方案2
0 2022-09-28 22:05:44