简体   繁体   English

Dataflow Pub/Sub 流作业卡住并不断修改消息的 ACK 截止日期

[英]Dataflow Pub/Sub streaming jobs get stuck and keep modifying ACK deadline of messages

We've been using Dataflow with the Python BEAM SDK ( 2.34.0 and 2.35.0 ) for two different streaming jobs, both having as input(s) Pub/Sub topic(s).我们一直将 Dataflow 与 Python BEAM SDK( 2.34.02.35.0 )一起用于两个不同的流作业,两者都具有作为输入的 Pub/Sub 主题。 One of those jobs windows input messages and groups them before writing them to Cloud Storage.其中一项作业 windows 输入消息并将它们分组,然后再将它们写入云存储。 The other does not apply any windowing (but makes use of timers).另一个不应用任何窗口(但使用计时器)。 Those jobs currently process very few messages (about 1 message per second), and have a single worker each.这些作业目前处理的消息很少(大约每秒 1 条消息),并且每个作业都有一个工作人员。

In the past few days, both these jobs have been getting stuck around once a day, causing outages in our system.在过去的几天里,这两项工作每天都被卡住一次,导致我们的系统中断。 The symptoms are:症状是:

  • Pub/Sub messages are no longer acknowledged. Pub/Sub 消息不再被确认。
  • However, the pull request rate stays the same on the subscriptions.但是,订阅的拉取请求率保持不变。
  • The ACK deadline on the messages keeps getting extended.消息的 ACK 截止日期不断延长。 (1) (1)
  • Subsequent steps (after ReadFromPubSub ) don't process any message.后续步骤(在ReadFromPubSub之后)不处理任何消息。
  • Sometimes (but now always) we get low-level error messages (2) in the logs (not related to our code).有时(但现在总是)我们在日志中收到低级错误消息(2)(与我们的代码无关)。
  • The job getting stuck is preceded by info-level logs about gRPC connections timing out and other network-related errors.卡住的作业之前是关于 gRPC 连接超时和其他网络相关错误的信息级日志。 (3) However those logs sometimes occur without being followed by an outage. (3) 但是,这些日志有时会发生而不会发生中断。
  • Dataflow (correctly) detects an increase in the system lag.数据流(正确)检测到系统延迟增加。 (4) (4)

Dataflow does not trigger any autoscaling in response to the messages stacking up in the subscription.数据流不会触发任何自动缩放以响应订阅中堆积的消息。 This is probably due to the CPU of the VMs staying low (as messages are no longer processed anyway).这可能是由于 VM 的 CPU 保持较低(因为无论如何都不再处理消息)。 Also, from what I understand, messages are only ACK'ed once they've been safely committed at the end of a stage (both our jobs contain two stages).此外,据我了解,消息只有在阶段结束时安全提交后才会被确认(我们的两个工作都包含两个阶段)。 But Dataflow might still actually be reading the messages (as it pushes the ACK deadline for them).但 Dataflow 可能实际上仍在读取消息(因为它推动了消息的 ACK 截止日期)。 However those messages never leave the ReadFromPubSub step.但是,这些消息永远不会离开ReadFromPubSub步骤。

Manually deleting the VM worker (from the Compute Engine console) triggers the automatic recreation of a VM and unstucks the job.手动删除 VM Worker(从 Compute Engine 控制台)会触发 VM 的自动重新创建并取消作业。 Consumption of the Pub/Sub subscription resumes and everything returns to normal. Pub/Sub 订阅的消费恢复,一切恢复正常。

How can we solve this problem and ensure our jobs don't get stuck?我们怎样才能解决这个问题并确保我们的工作不会卡住? We're out of ideas as Dataflow only produces few logs and our business code does not seem to be responsible for this behaviour.我们没有想法,因为 Dataflow 只产生很少的日志,而且我们的业务代码似乎不对这种行为负责。

(1) Dataflow making requests to extend the messages ACK deadline. (1)数据流请求延长消息的 ACK 期限。

(2) Error logs. (2) 错误日志。 ReadStream-process can also be MergeBuckets-process . ReadStream-process也可以是MergeBuckets-process NameOfAGroupByKeyStep is the step at which the execution passes from stage 1 to stage 2. NameOfAGroupByKeyStep是执行从阶段 1 传递到阶段 2 的步骤。

Stuck state: workflow-msec-<NameOfAGroupByKeyStep>/ReadStream-process, reporter: 0x4566bdcd4aa8, with stack:
--- Thread (name: futex-default-SDomainT/132) stack: ---
PC: @     0x55f23554d580  thread::(anonymous namespace)::FutexDomain::RawBlock()
@     0x55f23554d580  thread::(anonymous namespace)::FutexDomain::RawBlock()
@     0x55f23554cbbe  thread::(anonymous namespace)::FutexDomain::BlockCurrent()
@     0x55f2356e7aac  base::scheduling::Downcalls::UserSchedule()
@     0x55f2356e689e  AbslInternalPerThreadSemWait
@     0x55f235749e84  absl::CondVar::WaitCommon()
@     0x55f23554c221  thread::SelectUntil()
@     0x55f2345be1cb  dist_proc::dax::workflow::(anonymous namespace)::BatchingWindmillGetDataClient::GetData()
@     0x55f2345ac148  dist_proc::dax::workflow::StreamingRpcWindmillServiceStreamingServer::GetData()
@     0x55f234ae9f85  dist_proc::dax::workflow::WindmillServiceStreamingServerProxy::GetData()
@     0x55f234945ad3  dist_proc::dax::workflow::StateManager::PrefetchAll()
@     0x55f23494521b  dist_proc::dax::workflow::StateManager::ReadTag()
@     0x55f23493c3d6  dist_proc::dax::workflow::WindmillWindowingAPIDelegate::ReadKeyedStateImplVirtual()
@     0x55f2349420ed  dist_proc::dax::workflow::WindowingAPIDelegate::ReadKeyedStateImpl()
@     0x55f234941fd2  dist_proc::dax::workflow::WindmillCacheAccess::ReadKeyedStateImpl()
@     0x55f2346e6ec8  dist_proc::dax::workflow::CacheAccess::ReadStateFromCache<>()::{lambda()#1}::operator()()
@     0x55f2346e6e8e  absl::functional_internal::InvokeObject<>()
@     0x55f234942912  std::__u::__function::__policy_invoker<>::__call_impl<>()
@     0x55f2349c5927  dist_proc::dax::workflow::StateObjectsCache::ReadImpl()
@     0x55f2349c56f5  dist_proc::dax::workflow::StateObjectsCache::Read()

(3) Info-level logs that precedes outages, about networking issues: (3) 中断前的信息级日志,关于网络问题:

I0128 16:07:09.289409461     166 subchannel.cc:945]          subchannel 0x473cbc81a000 {address=ipv4:74.125.133.95:443, args=grpc.client_channel_factory=0x473cbfcb4690, grpc.default_authority=europe-west1-dataflowstreaming-pa.googleapis.com, grpc.dns_enable_srv_queries=1, grpc.http2_scheme=https, grpc.internal.channel_credentials=0x473cbf494f78, grpc.internal.security_connector=0x473cbb5f0230, grpc.internal.subchannel_pool=0x473cbf766870, grpc.keepalive_permit_without_calls=1, grpc.keepalive_time_ms=60000, grpc.keepalive_timeout_ms=60000, grpc.max_metadata_size=1048576, grpc.max_receive_message_length=-1, grpc.primary_user_agent=grpc-c++/1.44.0-dev, grpc.resource_quota=0x473cbf752ca8, grpc.server_uri=dns:///europe-west1-dataflowstreaming-pa.googleapis.com}: connect failed: {"created":"@1643386029.289272376","description":"Failed to connect to remote host: FD shutdown","file":"third_party/grpc/src/core/lib/iomgr/ev_poll_posix.cc","file_line":500,"grpc_status":14,"os_error":"Timeout occurred","referenced_errors":[{"created":"@1643386029.289234760","description":"connect() timed out","file":"third_party/grpc/src/core/lib/iomgr/tcp_client_posix.cc","file_line":114}],"target_address":"ipv4:74.125.133.95:443"}

(4) Dataflow correctly detecting the system lag increasing (4)正确检测系统延迟增加的数据流

After contacting Google's support team, we never got a clear answer to what was the problem, but it stopped occurring.在联系 Google 的支持团队后,我们始终没有得到明确的答案,但问题已经停止了。 We just concluded it was an internal error that was eventually fixed by the Dataflow team.我们刚刚得出结论,这是一个内部错误,最终由 Dataflow 团队修复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何修改 GCP Pub/Sub 上推送订阅消息的确认截止日期? - How to modify ack deadline for a push subscription message on GCP Pub/Sub? Pub/Sub 测试:即使通过了 ack_deadline,客户端也会收到消息 - Pub/Sub testing: message received by client even when ack_deadline is passed 如何处理 Google Dataflow 中缺少来自 Pub/Sub 的消息 - How to handle lack of messages from Pub/Sub in Google Dataflow 在 Dataflow 流中处理来自 Pub/Sub 消息的文件 - Process file from a Pub/Sub message in Dataflow streaming Google Pub/Sub 模拟器:获取未确认的消息? - Google Pub/Sub emulator: Get unacknowledged messages? 即使将确认截止日期设置为 600 秒,Pub/Sub 也会在 10 秒后重新发送消息。 如何克服这个问题? - Pub/Sub re-sending message after 10 sec even setting ack deadline 600 sec. How to overcome this issue? 谷歌云发布/订阅 - 订阅者未将消息转发到死信主题 - Google cloud Pub/Sub - subscriber not forwarding messages to dead letter topic 发送到在创建订阅之前存在的主题的 GCP Pub/Sub 消息 - GCP Pub/Sub Messages sent to topic that existed before the subscription was created 源存储库无法在 pub/sub 上发布消息 - source repository does not work to publish messages on pub/sub Apache Beam/Dataflow 不会丢弃来自 Pub/Sub 的延迟数据 - Apache Beam/Dataflow doesn't discard late data from Pub/Sub
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM