简体   繁体   中英

How is persistent disk use determined in GCP Dataflow?

In the pricing section, Google says that there is a default amount of PD per worker (varies depending on batch vs streaming). I am running a job, and the amount of persistent disk use is much higher than it should be, given the number of workers that I have (compared to the default PD use). This is consistent across multiple distinct jobs. What is causing the increased PD use? For reference, the default is 480 GB for a streaming worker, but I am getting charged for 5888 GB.

Update as of 2021

Dataflow now has Streaming Engine - streaming engine does NOT rely on persistent disks to hold state for streaming jobs - instead it provides a 'service' that abstracts streaming state/snapshot storage.

If Disk billing is a concern in your streaming pipelines, consider using streaming engine.

See more information: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-engine


This is a streaming pipeline with autoscaling enabled.

According to https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling :

Streaming pipelines are deployed with a fixed pool of persistent disks, equal in number to --maxNumWorkers

According to https://cloud.google.com/dataflow/service/dataflow-service-desc#persistent-disk-resources :

The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode.

So the expected value of "Current PD" should be around (your value of maxNumWorkers ) * 400GB, rather than 4 * 400GB.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM