简体   繁体   English

Apache Beam / Google Dataflow最后一步只能运行一次

[英]Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery. 我有一个管道,可以在其中下载数千个文件,然后对其进行转换并将其以CSV格式存储在Google云存储中,然后再在bigquery上运行加载作业。

This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports. 效果很好,但是当我运行数千个加载作业(每个降级文件一个)时,我达到了导入配额。

I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job. 我更改了代码,因此它列出了存储桶中的所有文件,并运行一项作业,并将所有文件作为作业的参数。

So basically I need the final step to be run only once, when all the data has been processed. 因此,基本上,在处理完所有数据后,我只需要将最后一步仅运行一次即可。 I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it. 我想我可以使用groupBy变换来确保已处理所有数据,但是我想知道是否有更好/更标准的方法。

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket. 如果我正确理解了您的问题,则可能是我们的一个数据流中存在类似的问题-由于达到了分别针对GCS中的每个文件触发数据流的事实,我们达到了“每天每表加载作业” BigQuery限制存储桶中有1000多个文件。

In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names 最后,解决问题的方法非常简单-我们修改了TextIO.read转换,以使用通配符而不是单个文件名

i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")

In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources. 这样,尽管有多个源,但仅执行了一个数据流作业,因此,所有写入BigQuery的数据都被视为单个加载作业。

Not sure if you can apply the same approach, tho. 不确定是否可以应用相同的方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM