简体   繁体   中英

Dataflow pipeline pauses after scaling down with low CPU utilization

In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.

We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.

It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.

metrics

What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.

GroupByKey throughput

I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.

Does anyone know what might be causing this issue?

The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.

Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.

This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.

We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.

https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM