简体繁体中英

How is persistent disk use determined in GCP Dataflow?

原文 2017-07-13 17:45:07 4 1 google-cloud-platform/ google-cloud-dataflow

In the pricing section, Google says that there is a default amount of PD per worker (varies depending on batch vs streaming). I am running a job, and the amount of persistent disk use is much higher than it should be, given the number of workers that I have (compared to the default PD use). This is consistent across multiple distinct jobs. What is causing the increased PD use? For reference, the default is 480 GB for a streaming worker, but I am getting charged for 5888 GB.

1 answers

Update as of 2021

Dataflow now has Streaming Engine - streaming engine does NOT rely on persistent disks to hold state for streaming jobs - instead it provides a 'service' that abstracts streaming state/snapshot storage.

If Disk billing is a concern in your streaming pipelines, consider using streaming engine.

See more information: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#streaming-engine

This is a streaming pipeline with autoscaling enabled.

According to https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling :

Streaming pipelines are deployed with a fixed pool of persistent disks, equal in number to --maxNumWorkers

According to https://cloud.google.com/dataflow/service/dataflow-service-desc#persistent-disk-resources :

The default size of each persistent disk is 250 GB in batch mode and 400 GB in streaming mode.

So the expected value of "Current PD" should be around (your value of maxNumWorkers ) * 400GB, rather than 4 * 400GB.

How to find GCP Persistent Disk Usage?

Automate GCP persistent disk initialization

GCP Dataflow Job Deployment

How can I specify the IP number of GCP's Dataflow?

GCP dataflow and On-premDB

How to mentioned Required and Optional Parameters in GCP Dataflow Template?

GCP Dataflow : print PCollection data

Drain GCP dataflow jobs with terraform

How can I put files into a persistent disk before starting up a VM that relies on data in the disk?

Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to find GCP Persistent Disk Usage? Automate GCP persistent disk initialization GCP Dataflow Job Deployment How can I specify the IP number of GCP's Dataflow? GCP dataflow and On-premDB How to mentioned Required and Optional Parameters in GCP Dataflow Template? GCP Dataflow : print PCollection data Drain GCP dataflow jobs with terraform How can I put files into a persistent disk before starting up a VM that relies on data in the disk? Spring Cloud Dataflow vs Apache Beam/GCP Dataflow Clarification

Related Tags

How is persistent disk use determined in GCP Dataflow?

Question

1 answers

solution1 3 ACCPTED 2017-07-13 21:59:15

solution1
3 ACCPTED 2017-07-13 21:59:15