简体   繁体   中英

Best Practice - Writing Unbounded PCollection to GCS Bucket with restricted Service Account

Trying to make my question as broad as possible:

When writing an unbounded PCollection to a GCS bucket using TextIO, whilst using a service account with the principle of least privilege that does not have GCS deletion access the follow error occurs in dataflow:

Error trying to copy gs://[Temporary beam file] to gs://[JSON We expect]: {"code":403,"errors":[{"domain":"global","message":"[Service Account] does not have storage.objects.delete access to [JSONFile]","reason":"forbidden"}],"message":"[Service Account] does not have storage.objects.delete access to [JSON File]"}

The above error makes sense, considering that we are not allowing the service account to have deletion access to the bucket we are using, and there are shards of files that the dataflow pipeline is attempting to clean up.

The question however is, Is the best practice at this point to provide deletion access to the dataflow service account and keep using TextIO? or would it be better to use a DoFn on the PCollection we would like to ingest and use a DoFn to write each individual element into the GCS bucket incrementally using the GCS API? thus subverting the issue of the cleanup of the shards.

Thanks

There is a withTempDirectory function in TextIO that should allow you to set the temporary bucket to a bucket where the service account has higher privileges.

I believe that TextIO will put the files within the tempLocation of your pipeline. You may be able to set the tempLocation of your pipeline in a bucket that is not so security-critical, and write the result to the secure bucket.

LMK if any of those alternatives help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM