简体   繁体   中英

oldest_unacked_message_age and num_undelivered_messages in Pubsub Monitoring

I read in pubsub docs that if both the oldest_unacked_message_age and num_undelivered_messages are growing in tandem, it indicates the subscribers not keeping up with message volume. Can someone explain how or elaborate it

A subscriber is an application with a subscription to a single or multiple topics to receive messages from it.After a message is sent to a subscriber, the subscriber must acknowledge the message.

If Pub/Sub attempts to deliver a message but the subscriber can't acknowledge it due to bugs in your code or other reason within the time frame, Pub/Sub automatically tries to resend the message. By default, Pub/Sub tries resending the message immediately. Pub/Sub will resend messages that can't be acknowledged.If there are an inadequate number of subscribers to handle high volume of messages it might take too long to acknowledge messages, the messages are redelivered, resulting in the subscribers seeing duplicate messages. It indicates the subscribers not keeping up with message volume.

We can prevent the above situation by:

  1. Add more subscriber threads or processes.
  2. Add more subscriber machines or containers.
  3. Look for signs of bugs in your code that prevent it from successfully acknowledging messages or processing them in a timely fashion.

For more information you can follow this link1 , link2 .

These two metrics measure two different properties of a subscription's backlog. Let's examine how they tend to grow by looking in terms of oldest_unacked_message_age first, which gives the age of the oldest message that has not been acknowledged by subscribers. This can grow for several reasons including:

  1. It is a message that the subscriber cannot handle and therefore keeps getting nacked or has its ack deadline expire, which results in redelivery. If this is the case, you will typically seen oldest_unacked_message_age grow in tandem with passing time. In other words, for every minute that passes, the value of oldest_unacked_message_age increases by a minute. If only a small number of messages are being rejected, then num_undelivered_messages will reflect the number of messages that are being rejected and will likely be much smaller than the total number of messages published. A dead letter topic can help with such messages.

  2. The subscriber is not able to keep up with the load of published messages. If there is not enough subscriber capacity to keep up with load, then a backlog of messages to be delivered builds up. As this backlog grows, the age of the oldest message in the backlog likely grows as well. Therefore, in this case, oldest_unacked_message_age and num_undelivered_messages both increase (or at least, don't decrease) over time. In this case, oldest_unacked_message_age may not grow in lockstep with time; it's possible that you are able to consume some older messages, but just not able to keep up fully, so the oldest_unacked_message_age may be growing more slowly or may remain steady at a non-zero value.

The second case is the one to which you are referring. Subscribers may not be able to keep up for several reasons and the solution may vary depending on the reason:

  1. Downstream dependencies are too slow: If you are, for example, writing to a database from your subscriber based on messages received, and that is very slow, you may need to tune the behavior of the database to speed up the processing of messages.
  2. You don't have enough subscriber capacity: You may need to turn up more subscriber clients or increase the resources (RAM,.network, CPU, or threading) on the subscriber instances you already have. Increasing the number of subscriber clients often helps, though it may be more cost-effective to try to tune the instances you already have. Autoscalers like the one in GCE allow you to automatically alter the number of instances based on the unacknowledged Pub/Sub messages .
  3. Your flow control limits are set too tightly: If your instances are not exceeding any of their resources and downstream dependences are not the limiting factor, but processing is still too slow to keep up with the backlog, look at tuning the flow control settings with higher values. The flow control settings limit the number of messages that can be outstanding to the subscriber client at a time, which limits the ability to process a backlog. You may want to increase the limits in this case in order to saturate the capacity of your subscriber clients.

If your backlog only grows temporarily due to a brief spike in publish load, you may not need to make any changes at all. Absorbing these temporary spikes is exactly what a Pub/Sub system is designed to do. However, if the latency of processing messages is too high for your application or the backlog grows indefinitely, you may need to take some of the above steps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM