简体   繁体   中英

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.

This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.

I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.

So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.

In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names

i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")

In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.

Not sure if you can apply the same approach, tho.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM