[英]Dataflow Pub/Sub streaming jobs get stuck and keep modifying ACK deadline of messages
We've been using Dataflow with the Python BEAM SDK ( 2.34.0
and 2.35.0
) for two different streaming jobs, both having as input(s) Pub/Sub topic(s).我们一直将 Dataflow 与 Python BEAM SDK( 2.34.0
和2.35.0
)一起用于两个不同的流作业,两者都具有作为输入的 Pub/Sub 主题。 One of those jobs windows input messages and groups them before writing them to Cloud Storage.其中一项作业 windows 输入消息并将它们分组,然后再将它们写入云存储。 The other does not apply any windowing (but makes use of timers).另一个不应用任何窗口(但使用计时器)。 Those jobs currently process very few messages (about 1 message per second), and have a single worker each.这些作业目前处理的消息很少(大约每秒 1 条消息),并且每个作业都有一个工作人员。
In the past few days, both these jobs have been getting stuck around once a day, causing outages in our system.在过去的几天里,这两项工作每天都被卡住一次,导致我们的系统中断。 The symptoms are:症状是:
ReadFromPubSub
) don't process any message.后续步骤(在ReadFromPubSub
之后)不处理任何消息。 Dataflow does not trigger any autoscaling in response to the messages stacking up in the subscription.数据流不会触发任何自动缩放以响应订阅中堆积的消息。 This is probably due to the CPU of the VMs staying low (as messages are no longer processed anyway).这可能是由于 VM 的 CPU 保持较低(因为无论如何都不再处理消息)。 Also, from what I understand, messages are only ACK'ed once they've been safely committed at the end of a stage (both our jobs contain two stages).此外,据我了解,消息只有在阶段结束时安全提交后才会被确认(我们的两个工作都包含两个阶段)。 But Dataflow might still actually be reading the messages (as it pushes the ACK deadline for them).但 Dataflow 可能实际上仍在读取消息(因为它推动了消息的 ACK 截止日期)。 However those messages never leave the ReadFromPubSub
step.但是,这些消息永远不会离开ReadFromPubSub
步骤。
Manually deleting the VM worker (from the Compute Engine console) triggers the automatic recreation of a VM and unstucks the job.手动删除 VM Worker(从 Compute Engine 控制台)会触发 VM 的自动重新创建并取消作业。 Consumption of the Pub/Sub subscription resumes and everything returns to normal. Pub/Sub 订阅的消费恢复,一切恢复正常。
How can we solve this problem and ensure our jobs don't get stuck?我们怎样才能解决这个问题并确保我们的工作不会卡住? We're out of ideas as Dataflow only produces few logs and our business code does not seem to be responsible for this behaviour.我们没有想法,因为 Dataflow 只产生很少的日志,而且我们的业务代码似乎不对这种行为负责。
(1) Dataflow making requests to extend the messages ACK deadline. (1)数据流请求延长消息的 ACK 期限。
(2) Error logs. (2) 错误日志。 ReadStream-process
can also be MergeBuckets-process
. ReadStream-process
也可以是MergeBuckets-process
。 NameOfAGroupByKeyStep
is the step at which the execution passes from stage 1 to stage 2. NameOfAGroupByKeyStep
是执行从阶段 1 传递到阶段 2 的步骤。
Stuck state: workflow-msec-<NameOfAGroupByKeyStep>/ReadStream-process, reporter: 0x4566bdcd4aa8, with stack:
--- Thread (name: futex-default-SDomainT/132) stack: ---
PC: @ 0x55f23554d580 thread::(anonymous namespace)::FutexDomain::RawBlock()
@ 0x55f23554d580 thread::(anonymous namespace)::FutexDomain::RawBlock()
@ 0x55f23554cbbe thread::(anonymous namespace)::FutexDomain::BlockCurrent()
@ 0x55f2356e7aac base::scheduling::Downcalls::UserSchedule()
@ 0x55f2356e689e AbslInternalPerThreadSemWait
@ 0x55f235749e84 absl::CondVar::WaitCommon()
@ 0x55f23554c221 thread::SelectUntil()
@ 0x55f2345be1cb dist_proc::dax::workflow::(anonymous namespace)::BatchingWindmillGetDataClient::GetData()
@ 0x55f2345ac148 dist_proc::dax::workflow::StreamingRpcWindmillServiceStreamingServer::GetData()
@ 0x55f234ae9f85 dist_proc::dax::workflow::WindmillServiceStreamingServerProxy::GetData()
@ 0x55f234945ad3 dist_proc::dax::workflow::StateManager::PrefetchAll()
@ 0x55f23494521b dist_proc::dax::workflow::StateManager::ReadTag()
@ 0x55f23493c3d6 dist_proc::dax::workflow::WindmillWindowingAPIDelegate::ReadKeyedStateImplVirtual()
@ 0x55f2349420ed dist_proc::dax::workflow::WindowingAPIDelegate::ReadKeyedStateImpl()
@ 0x55f234941fd2 dist_proc::dax::workflow::WindmillCacheAccess::ReadKeyedStateImpl()
@ 0x55f2346e6ec8 dist_proc::dax::workflow::CacheAccess::ReadStateFromCache<>()::{lambda()#1}::operator()()
@ 0x55f2346e6e8e absl::functional_internal::InvokeObject<>()
@ 0x55f234942912 std::__u::__function::__policy_invoker<>::__call_impl<>()
@ 0x55f2349c5927 dist_proc::dax::workflow::StateObjectsCache::ReadImpl()
@ 0x55f2349c56f5 dist_proc::dax::workflow::StateObjectsCache::Read()
(3) Info-level logs that precedes outages, about networking issues: (3) 中断前的信息级日志,关于网络问题:
I0128 16:07:09.289409461 166 subchannel.cc:945] subchannel 0x473cbc81a000 {address=ipv4:74.125.133.95:443, args=grpc.client_channel_factory=0x473cbfcb4690, grpc.default_authority=europe-west1-dataflowstreaming-pa.googleapis.com, grpc.dns_enable_srv_queries=1, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x473cbf494f78, grpc.internal.security_connector=0x473cbb5f0230, grpc.internal.subchannel_pool=0x473cbf766870, grpc.keepalive_permit_without_calls=1, grpc.keepalive_time_ms=60000, grpc.keepalive_timeout_ms=60000, grpc.max_metadata_size=1048576, grpc.max_receive_message_length=-1, grpc.primary_user_agent=grpc-c++/1.44.0-dev, grpc.resource_quota=0x473cbf752ca8, grpc.server_uri=dns:///europe-west1-dataflowstreaming-pa.googleapis.com}: connect failed: {"created":"@1643386029.289272376","description":"Failed to connect to remote host: FD shutdown","file":"third_party/grpc/src/core/lib/iomgr/ev_poll_posix.cc","file_line":500,"grpc_status":14,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1643386029.289234760","description":"connect() timed out","file":"third_party/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":114}],"target_address":"ipv4:74.125.133.95:443"}
(4) Dataflow correctly detecting the system lag increasing (4)正确检测系统延迟增加的数据流
After contacting Google's support team, we never got a clear answer to what was the problem, but it stopped occurring.在联系 Google 的支持团队后,我们始终没有得到明确的答案,但问题已经停止了。 We just concluded it was an internal error that was eventually fixed by the Dataflow team.我们刚刚得出结论,这是一个内部错误,最终由 Dataflow 团队修复。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.