Can I process pcollections in apache beam in chunks? Can I make batches of pcollection and process each batch separately?

Question

I have approximately 2000 files on the GCS and I wanted to process all files and upload them to BigQuery. When I process using pipelines, these pipelines are not able to complete themselves. It fails or stops working after processing 70% of files. I decided to make batches of 100 files each. I used BatchElements() this creates a list inside the list ie 2-D list. Are these Batches are going to execute themselves separately. Means pcollections of the first batch will execute themselves (from opening a file from GCS and uploading it to BigQuery) after that second batch will start working. Is this is working in a serial manner or in parallel? Will my job fail if any batch will fail?

xml = (
        p1
        | "Get xml name" >> beam.Create(gz_names)
        | "Batch elements" >> BatchElements(100)
        | "Open file" >> beam.ParDo(ReadGCS())
        | "Create XML" >> beam.ParDo(CreateXML())
        | "Parse Json" >> beam.ParDo(JsonParser())
        | "Remove elements" >> beam.ParDo(Filtering())
        | "Build Json" >> beam.ParDo(JsonBuilder())
        | "Write elements" >> beam.io.WriteToText(file_path_prefix="output12", file_name_suffix=".json")
)

p1.run()

Answer 1

Beam and other big data systems try to perform optimizations to improve the execution of your pipelines. The simplest optimization is called 'fusion', which is

In this case, the issue is that your Create and BatchElements and ReadGCS transforms all execute in the same worker, and do not get a chance to be distributed to multiple workers.

My suggestion is for you to try to reshuffle your data. This will enable it to be redistributed. Like so:

xml = (
        p1
        | "Get xml name" >> beam.Create(gz_names)
        | beam.Reshuffle()  # This will redistribute your data to other workers
        | "Open file" >> beam.ParDo(ReadGCS())
        | "Create XML" >> beam.ParDo(CreateXML())
        | "Parse Json" >> beam.ParDo(JsonParser())
        | "Remove elements" >> beam.ParDo(Filtering())
        | "Build Json" >> beam.ParDo(JsonBuilder())
        | "Write elements" >> beam.io.WriteToText(file_path_prefix="output12", file_name_suffix=".json")
)

p1.run()

Once you've done that, your pipeline should be able to make better progress.

Another tip that will make your pipeline more performant: Consider using the transforms under apache_beam.io.fileio . They include transforms to Match file names, and Read files, and they are standard in Beam.

If you'd like to dive deep into some runner optimizations, check out the FlumeJava paper: )

Can I process pcollections in apache beam in chunks? Can I make batches of pcollection and process each batch separately?

Question

1 answers

solution1
1 2021-04-06 19:57:27

Can I process pcollections in apache beam in chunks? Can I make batches of pcollection and process each batch separately?

Question

1 answers

solution1 1 2021-04-06 19:57:27

solution1
1 2021-04-06 19:57:27