Apache Beam - ReadFromText safely (pass over errors)

Question

I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files.

import apache_beam as beam

p1 = beam.Pipeline()

(p1
        | 'read' >> beam.io.ReadFromText('bad_file.bz2')
        | 'write' >> beam.io.WriteToText('file_out.txt')
 )

p1.run()

The problem is when the pipeline encounters a bad file ( example ). In this case, most of my bad files are malformed, not in bz2 format or simply empty, which confuses the decompressor, causing an OSError: Invalid data stream .

How can I tell ReadFromText to pass on these?

Answer 1

You may want to filter your files and then use apache_beam.io.textio.ReadAllFromText .

For example

with beam.Pipeline() as p:
    lines = (p
     | beam.MatchFiles("/path/to/*.bz2")
     | beam.Filter(lambda m: is_valid_bz2(m.path))
     | apache_beam.io.textio.ReadAllFromText())

Your is_valid_bz2 may want to use the filesystems utilities to be able to read from all supported filesystems.

Apache Beam - ReadFromText safely (pass over errors)

Question

1 answers

solution1
0 2023-01-12 23:23:53

Apache Beam - ReadFromText safely (pass over errors)

Question

1 answers

solution1 0 2023-01-12 23:23:53

solution1
0 2023-01-12 23:23:53