简体   繁体   English

由于工作人员失去联系,Google Cloud Dataflow无法执行合并功能

[英]Google Cloud Dataflow fails in combine function due to worker losing contact

My Dataflow consistently fails in my combine function with no errors reported in the logs beyond a single entry of: 我的数据流在我的合并功能中始终失败,除以下一项外,日志中未报告任何错误:

 A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service.

I am using the Apache Beam Python SDK 2.4.0. 我正在使用Apache Beam Python SDK 2.4.0。 I have tried performing this step with both CombinePerKey and CombineGlobally. 我尝试使用CombinePerKey和CombineGlobally执行此步骤。 The pipeline failed in the combine function in both cases. 在两种情况下,管道均无法执行合并功能。 The pipeline completes when running with a smaller amount of data. 当运行少量数据时,管道完成。

Am I exhausting worker resources and not being told about it? 我是在耗尽工人资源而不被告知吗? What can cause a worker to lose contact with the service? 是什么会导致工人失去与服务的联系?

Update: 更新:

Using n1-highmem-4 workers gives me the same failure. 使用n1-highmem-4工作人员n1-highmem-4给我带来同样的失败。 When I check Stackdriver I see no errors, but three kinds of warnings: No session file found , Refusing to split , and Processing lull . 当我检查Stackdriver时,没有看到任何错误,但是出现了三种警告:未No session file foundRefusing to splitProcessing lull My input collection size says it's 17,000 elements spread across ~60 MB, but Stackdriver has a statement saying I'm using ~25 GB on a single worker which is getting towards the max. 我的输入集合大小表明,它分布在约60 MB的空间中有17,000个元素,但是Stackdriver发表声明说,我正在单个工作人员上使用约25 GB的空间,这正在达到最大。 For this input, each accumulator created in my CombineFn should take roughly 150 MB memory. 对于此输入,在我的CombineFn中创建的每个累加器应占用大约150 MB的内存。 Is my pipeline creating too many accumulators and exhausting its memory? 我的管道会创建太多累加器并耗尽其内存吗? If so, how can I tell it to merge accumulators more often or limit the number created? 如果是这样,我如何告诉它更频繁地合并累加器或限制创建的累加器数量?

I do have an error log entry verifying my worker was killed due to OOM. 我确实有一个错误日志条目,验证我的工作人员是否因OOM而死亡。 It just isn't tagged as a worker error which is the default filtering for the Dataflow monitor. 它只是没有被标记为工作程序错误,这是Dataflow监视器的默认筛选。

The pipeline definition looks something like: 管道定义看起来像:

table1 = (p | "Read Table1" >> beam.io.Read(beam.io.BigQuerySource(query=query))
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))
table2 = (p | "Read Table2" >> beam.io.Read(beam.io.BigQuerySource(query=query))
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))

merged = ({"table1": table1, "table2": table2}
     | "Join" >> beam.CoGroupByKey()
     | "Reshape" >> beam.ParDo(ReshapeData())
     | "Key rows" >> beam.Map(lambda row: (row['key'], row)))
     | "Build matrix" >> beam.CombinePerKey(MatrixCombiner())  # Dies here
     | "Write matrix" >> beam.io.avroio.WriteToAvro())

以更少的工人来运行可以减少累加器并成功完成管道。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM