Dataflow Pub/Sub 流作业卡住并不断修改消息的 ACK 截止日期

Question

We've been using Dataflow with the Python BEAM SDK ( 2.34.0 and 2.35.0 ) for two different streaming jobs, both having as input(s) Pub/Sub topic(s).我们一直将 Dataflow 与 Python BEAM SDK（ 2.34.0和2.35.0 ）一起用于两个不同的流作业，两者都具有作为输入的 Pub/Sub 主题。 One of those jobs windows input messages and groups them before writing them to Cloud Storage.其中一项作业 windows 输入消息并将它们分组，然后再将它们写入云存储。 The other does not apply any windowing (but makes use of timers).另一个不应用任何窗口（但使用计时器）。 Those jobs currently process very few messages (about 1 message per second), and have a single worker each.这些作业目前处理的消息很少（大约每秒 1 条消息），并且每个作业都有一个工作人员。

In the past few days, both these jobs have been getting stuck around once a day, causing outages in our system.在过去的几天里，这两项工作每天都被卡住一次，导致我们的系统中断。 The symptoms are:症状是：

Pub/Sub messages are no longer acknowledged. Pub/Sub 消息不再被确认。
However, the pull request rate stays the same on the subscriptions.但是，订阅的拉取请求率保持不变。
The ACK deadline on the messages keeps getting extended.消息的 ACK 截止日期不断延长。 (1) (1)
Subsequent steps (after ReadFromPubSub ) don't process any message.后续步骤（在ReadFromPubSub之后）不处理任何消息。
Sometimes (but now always) we get low-level error messages (2) in the logs (not related to our code).有时（但现在总是）我们在日志中收到低级错误消息（2）（与我们的代码无关）。
The job getting stuck is preceded by info-level logs about gRPC connections timing out and other network-related errors.卡住的作业之前是关于 gRPC 连接超时和其他网络相关错误的信息级日志。 (3) However those logs sometimes occur without being followed by an outage. (3) 但是，这些日志有时会发生而不会发生中断。
Dataflow (correctly) detects an increase in the system lag.数据流（正确）检测到系统延迟增加。 (4) (4)

Dataflow does not trigger any autoscaling in response to the messages stacking up in the subscription.数据流不会触发任何自动缩放以响应订阅中堆积的消息。 This is probably due to the CPU of the VMs staying low (as messages are no longer processed anyway).这可能是由于 VM 的 CPU 保持较低（因为无论如何都不再处理消息）。 Also, from what I understand, messages are only ACK'ed once they've been safely committed at the end of a stage (both our jobs contain two stages).此外，据我了解，消息只有在阶段结束时安全提交后才会被确认（我们的两个工作都包含两个阶段）。 But Dataflow might still actually be reading the messages (as it pushes the ACK deadline for them).但 Dataflow 可能实际上仍在读取消息（因为它推动了消息的 ACK 截止日期）。 However those messages never leave the ReadFromPubSub step.但是，这些消息永远不会离开ReadFromPubSub步骤。

Manually deleting the VM worker (from the Compute Engine console) triggers the automatic recreation of a VM and unstucks the job.手动删除 VM Worker（从 Compute Engine 控制台）会触发 VM 的自动重新创建并取消作业。 Consumption of the Pub/Sub subscription resumes and everything returns to normal. Pub/Sub 订阅的消费恢复，一切恢复正常。

How can we solve this problem and ensure our jobs don't get stuck?我们怎样才能解决这个问题并确保我们的工作不会卡住？ We're out of ideas as Dataflow only produces few logs and our business code does not seem to be responsible for this behaviour.我们没有想法，因为 Dataflow 只产生很少的日志，而且我们的业务代码似乎不对这种行为负责。

(1) Dataflow making requests to extend the messages ACK deadline. (1)数据流请求延长消息的 ACK 期限。

(2) Error logs. (2) 错误日志。 ReadStream-process can also be MergeBuckets-process . ReadStream-process也可以是MergeBuckets-process 。 NameOfAGroupByKeyStep is the step at which the execution passes from stage 1 to stage 2. NameOfAGroupByKeyStep是执行从阶段 1 传递到阶段 2 的步骤。

Stuck state: workflow-msec-<NameOfAGroupByKeyStep>/ReadStream-process, reporter: 0x4566bdcd4aa8, with stack:
--- Thread (name: futex-default-SDomainT/132) stack: ---
PC: @     0x55f23554d580  thread::(anonymous namespace)::FutexDomain::RawBlock()
@     0x55f23554d580  thread::(anonymous namespace)::FutexDomain::RawBlock()
@     0x55f23554cbbe  thread::(anonymous namespace)::FutexDomain::BlockCurrent()
@     0x55f2356e7aac  base::scheduling::Downcalls::UserSchedule()
@     0x55f2356e689e  AbslInternalPerThreadSemWait
@     0x55f235749e84  absl::CondVar::WaitCommon()
@     0x55f23554c221  thread::SelectUntil()
@     0x55f2345be1cb  dist_proc::dax::workflow::(anonymous namespace)::BatchingWindmillGetDataClient::GetData()
@     0x55f2345ac148  dist_proc::dax::workflow::StreamingRpcWindmillServiceStreamingServer::GetData()
@     0x55f234ae9f85  dist_proc::dax::workflow::WindmillServiceStreamingServerProxy::GetData()
@     0x55f234945ad3  dist_proc::dax::workflow::StateManager::PrefetchAll()
@     0x55f23494521b  dist_proc::dax::workflow::StateManager::ReadTag()
@     0x55f23493c3d6  dist_proc::dax::workflow::WindmillWindowingAPIDelegate::ReadKeyedStateImplVirtual()
@     0x55f2349420ed  dist_proc::dax::workflow::WindowingAPIDelegate::ReadKeyedStateImpl()
@     0x55f234941fd2  dist_proc::dax::workflow::WindmillCacheAccess::ReadKeyedStateImpl()
@     0x55f2346e6ec8  dist_proc::dax::workflow::CacheAccess::ReadStateFromCache<>()::{lambda()#1}::operator()()
@     0x55f2346e6e8e  absl::functional_internal::InvokeObject<>()
@     0x55f234942912  std::__u::__function::__policy_invoker<>::__call_impl<>()
@     0x55f2349c5927  dist_proc::dax::workflow::StateObjectsCache::ReadImpl()
@     0x55f2349c56f5  dist_proc::dax::workflow::StateObjectsCache::Read()

(3) Info-level logs that precedes outages, about networking issues: (3) 中断前的信息级日志，关于网络问题：

I0128 16:07:09.289409461     166 subchannel.cc:945]          subchannel 0x473cbc81a000 {address=ipv4:74.125.133.95:443, args=grpc.client_channel_factory=0x473cbfcb4690, grpc.default_authority=europe-west1-dataflowstreaming-pa.googleapis.com, grpc.dns_enable_srv_queries=1, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x473cbf494f78, grpc.internal.security_connector=0x473cbb5f0230, grpc.internal.subchannel_pool=0x473cbf766870, grpc.keepalive_permit_without_calls=1, grpc.keepalive_time_ms=60000, grpc.keepalive_timeout_ms=60000, grpc.max_metadata_size=1048576, grpc.max_receive_message_length=-1, grpc.primary_user_agent=grpc-c++/1.44.0-dev, grpc.resource_quota=0x473cbf752ca8, grpc.server_uri=dns:///europe-west1-dataflowstreaming-pa.googleapis.com}: connect failed: {"created":"@1643386029.289272376","description":"Failed to connect to remote host: FD shutdown","file":"third_party/grpc/src/core/lib/iomgr/ev_poll_posix.cc","file_line":500,"grpc_status":14,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1643386029.289234760","description":"connect() timed out","file":"third_party/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":114}],"target_address":"ipv4:74.125.133.95:443"}

(4) Dataflow correctly detecting the system lag increasing (4)正确检测系统延迟增加的数据流

Answer 1

After contacting Google's support team, we never got a clear answer to what was the problem, but it stopped occurring.在联系 Google 的支持团队后，我们始终没有得到明确的答案，但问题已经停止了。 We just concluded it was an internal error that was eventually fixed by the Dataflow team.我们刚刚得出结论，这是一个内部错误，最终由 Dataflow 团队修复。

Dataflow Pub/Sub 流作业卡住并不断修改消息的 ACK 截止日期

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-08-26 06:12:35

Dataflow Pub/Sub 流作业卡住并不断修改消息的 ACK 截止日期

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-08-26 06:12:35

解决方案1
0 已采纳 2022-08-26 06:12:35