简体繁体中英

Dataflow pipeline pauses after scaling down with low CPU utilization

原文 2021-08-19 09:54:37 8 1 google-cloud-dataflow/ apache-beam

In our streaming pipeline we read data from pubsub, do some validations and then group it by a key in a 10 second gap session window. Afterwards the data is processed further and written to bigtable and pubsub again.

We're using apache beam 2.28 and the dataflow streaming engine. During the day we process more data than over night and the pipeline scales up the number of workers (n2d-standard-4) automatically. Mostly it scales up from 2 workers to 4 or 5 to reduce the backlog. After that it will scale down again as the CPU utilization is too low for 4 or 5 workers.

It is at this point that the CPU utilization drops to nearly 0% for all workers and the entire pipeline starts lagging behind massively. The result is that the number of workers is scaled up to a higher number again and the pipeline processing the data further. After the backlog is reduced again, the number of workers is gradually lowered and the same issue arises.

metrics

What we notice is that in the GroupByKey step, the input throughput stays more or less the same, but the output throughput drops to 0.

GroupByKey throughput

I know using GroupByKey can have hotkeys, but then I would expect the CPU utilization of 1 worker to be very high while the others have nothing to do.

Does anyone know what might be causing this issue?

1 answers

The issue was caused by by the combination of using the session window with a groupbykey, how the watermark for a pubsub unbounded source works and when the acknowledges are being sent to pubsub.

Our session window with a gap of 10 seconds sometimes didn't output any messages for a couple of minutes (due to no early trigger being configured and messages continuously arriving for the same key within the 10 second session gap). Because these steps are part of the first fused stage in the actual execution of our pipeline, this lead to some messages not being acknowledged to pubsub (the ack is only sent when the first fused stage is completed). The oldest unacknowledged message time on the subscription kept on rising, causing the watermark not to advance.

This issue was became more outspoken due to the acknowledgement deadline being set to 10 minutes. When the number of workers scaled down, this caused the issue described in the original question.

We were able to solve this by adding a Reshuffle before the creation of the session window (with the groupbykey) and decreasing the acknowledgement deadline.

https://cloud.google.com/blog/products/data-analytics/handling-duplicate-data-in-streaming-pipeline-using-pubsub-dataflow https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization

HPA scaling is triggered during Spring boot startup. Current CPU metric in HPA is high though CPU utilization at node level is low

Dataflow streaming job not scaling down

How to improve low throughput groupbykey in dataflow pipeline

Move file after Streaming Pipeline dataflow

Dataflow not scaling

Low AKS Cluster utilization

EKS pod schedule according to the node CPU utilization

Sometimes RDS CPU utilization is going to very high

Google DataFlow Updating an existing pipeline

The Additional Paramates at Dataflow into Beam Pipeline

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question HPA scaling is triggered during Spring boot startup. Current CPU metric in HPA is high though CPU utilization at node level is low Dataflow streaming job not scaling down How to improve low throughput groupbykey in dataflow pipeline Move file after Streaming Pipeline dataflow Dataflow not scaling Low AKS Cluster utilization EKS pod schedule according to the node CPU utilization Sometimes RDS CPU utilization is going to very high Google DataFlow Updating an existing pipeline The Additional Paramates at Dataflow into Beam Pipeline

Related Tags

Dataflow pipeline pauses after scaling down with low CPU utilization

Question

1 answers

solution1 1 ACCPTED 2021-08-25 13:08:55

solution1
1 ACCPTED 2021-08-25 13:08:55