简体   繁体   English

我可以分块处理 apache 光束中的 pcollections 吗? 我可以分批制作 pcollection 并分别处理每批吗?

[英]Can I process pcollections in apache beam in chunks? Can I make batches of pcollection and process each batch separately?

I have approximately 2000 files on the GCS and I wanted to process all files and upload them to BigQuery.我在 GCS 上有大约 2000 个文件,我想处理所有文件并将它们上传到 BigQuery。 When I process using pipelines, these pipelines are not able to complete themselves.当我使用管道进行处理时,这些管道无法自行完成。 It fails or stops working after processing 70% of files.它在处理 70% 的文件后失败或停止工作。 I decided to make batches of 100 files each.我决定每批制作 100 个文件。 I used BatchElements() this creates a list inside the list ie 2-D list.我使用BatchElements()在列表中创建了一个列表,即二维列表。 Are these Batches are going to execute themselves separately.这些批次是否会单独执行。 Means pcollections of the first batch will execute themselves (from opening a file from GCS and uploading it to BigQuery) after that second batch will start working.意味着第一批的 pcollections 将在第二批开始工作后自行执行(从 GCS 打开文件并将其上传到 BigQuery)。 Is this is working in a serial manner or in parallel?这是以串行方式还是并行方式工作? Will my job fail if any batch will fail?如果任何批次失败,我的工作会失败吗?

xml = (
        p1
        | "Get xml name" >> beam.Create(gz_names)
        | "Batch elements" >> BatchElements(100)
        | "Open file" >> beam.ParDo(ReadGCS())
        | "Create XML" >> beam.ParDo(CreateXML())
        | "Parse Json" >> beam.ParDo(JsonParser())
        | "Remove elements" >> beam.ParDo(Filtering())
        | "Build Json" >> beam.ParDo(JsonBuilder())
        | "Write elements" >> beam.io.WriteToText(file_path_prefix="output12", file_name_suffix=".json")
)

p1.run()

Beam and other big data systems try to perform optimizations to improve the execution of your pipelines. Beam 和其他大数据系统尝试执行优化以改善管道的执行。 The simplest optimization is called 'fusion', which is最简单的优化称为“融合”,即

In this case, the issue is that your Create and BatchElements and ReadGCS transforms all execute in the same worker, and do not get a chance to be distributed to multiple workers.在这种情况下,问题是您的CreateBatchElements以及ReadGCS转换都在同一个工作人员中执行,并且没有机会分配给多个工作人员。

My suggestion is for you to try to reshuffle your data.我的建议是让您尝试重新调整您的数据。 This will enable it to be redistributed.这将使它能够被重新分配。 Like so:像这样:

xml = (
        p1
        | "Get xml name" >> beam.Create(gz_names)
        | beam.Reshuffle()  # This will redistribute your data to other workers
        | "Open file" >> beam.ParDo(ReadGCS())
        | "Create XML" >> beam.ParDo(CreateXML())
        | "Parse Json" >> beam.ParDo(JsonParser())
        | "Remove elements" >> beam.ParDo(Filtering())
        | "Build Json" >> beam.ParDo(JsonBuilder())
        | "Write elements" >> beam.io.WriteToText(file_path_prefix="output12", file_name_suffix=".json")
)

p1.run()

Once you've done that, your pipeline should be able to make better progress.完成此操作后,您的管道应该能够取得更好的进展。

Another tip that will make your pipeline more performant: Consider using the transforms under apache_beam.io.fileio .另一个可以提高管道性能的技巧:考虑使用apache_beam.io.fileio的转换。 They include transforms to Match file names, and Read files, and they are standard in Beam.它们包括匹配文件名和读取文件的转换,它们在 Beam 中是标准的。

If you'd like to dive deep into some runner optimizations, check out the FlumeJava paper: )如果您想深入了解一些运行器优化,请查看FlumeJava 论文:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM