Apache Beam / Google Dataflow Final step to run only once

Question

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.

This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.

I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.

So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.

Answer 1

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.

In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names

i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")

In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.

Not sure if you can apply the same approach, tho.

Apache Beam / Google Dataflow Final step to run only once

Question

1 answers

solution1
1 2018-04-17 07:00:07

Apache Beam / Google Dataflow Final step to run only once

Question

1 answers

solution1 1 2018-04-17 07:00:07

solution1
1 2018-04-17 07:00:07