作業圖太大，無法提交到 Google Cloud Dataflow

Question

我正在嘗試在 Dataflow 上運行作業，每當我嘗試提交它以使用 DataflowRunner 運行時，我都會從服務中收到以下錯誤：

{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Request payload size exceeds the limit: x bytes.",
    "reason" : "badRequest"
  } ],
  "message" : "Request payload size exceeds the limit: x bytes.",
  "status" : "INVALID_ARGUMENT"
}
Caused by: com.google.api.client.googleapis.json.GoogleJsonResponseException: 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
    "reason" : "badRequest",
    "debugInfo" : "detail: \"(3754670dbaa1cc6b): CreateJob fails due to Spanner error: New value exceeds the maximum size limit for this column in this database: Jobs.CloudWorkflowJob, size: 17278017, limit: 10485760.\"\n"
  } ],
  "message" : "(3754670dbaa1cc6b): The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.",
  "status" : "INVALID_ARGUMENT"
}

如何將我的工作更改為更小，或增加工作大小限制？

Answer 1

此問題有一個解決方法，可讓您將作業圖的大小增加到最多 100MB。 您可以指定此實驗： --experiments=upload_graph 。

實驗激活了一個新的提交路徑，該路徑將作業文件上傳到 GCS，並通過不包含作業圖的 HTTP 請求創建作業 - 只是對其的引用。

這樣做的缺點是 UI 可能無法正確顯示作業，因為它依賴於 API 請求來共享作業。

額外說明：減小作業圖的大小仍然是一種很好的做法。

一個重要提示是，有時可以創建一些匿名 DoFns / lambda 函數，它們的閉包中會有非常大的上下文，所以我建議查看代碼中的任何閉包，並確保它們不包含非常大的上下文他們自己。

也許避免匿名 lambdas/DoFns 會有所幫助，因為上下文將是 class 的一部分，而不是序列化對象。

Answer 2

我首先嘗試了上面的一些建議，但看起來其他答案中提到的upload_graph實驗在更高版本中被刪除了。

在將具有 90k+ 條目的 map 添加到現有管道時，為我解決問題的是將此 map 作為 MapSideInput 引入。 一般方法記錄在我用來與 Beam 交互的 Scio 框架的這個例子中：

https://spotify.github.io/scio/examples/RefreshingSideInputExample.scala.html

這種方法沒有明顯的性能影響。

作業圖太大，無法提交到 Google Cloud Dataflow

問題描述

2 個解決方案

解決方案1
0 已采納 2021-03-01 22:38:49

解決方案2
0 2022-09-28 22:05:44

作業圖太大，無法提交到 Google Cloud Dataflow

問題描述

2 個解決方案

解決方案1 0 已采納 2021-03-01 22:38:49

解決方案2 0 2022-09-28 22:05:44

解決方案1
0 已采納 2021-03-01 22:38:49

解決方案2
0 2022-09-28 22:05:44