简体   繁体   中英

Apache Beam - ReadFromText safely (pass over errors)

I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files.

import apache_beam as beam

p1 = beam.Pipeline()

(p1
        | 'read' >> beam.io.ReadFromText('bad_file.bz2')
        | 'write' >> beam.io.WriteToText('file_out.txt')
 )

p1.run()

The problem is when the pipeline encounters a bad file ( example ). In this case, most of my bad files are malformed, not in bz2 format or simply empty, which confuses the decompressor, causing an OSError: Invalid data stream .

How can I tell ReadFromText to pass on these?

You may want to filter your files and then use apache_beam.io.textio.ReadAllFromText .

For example

with beam.Pipeline() as p:
    lines = (p
     | beam.MatchFiles("/path/to/*.bz2")
     | beam.Filter(lambda m: is_valid_bz2(m.path))
     | apache_beam.io.textio.ReadAllFromText())

Your is_valid_bz2 may want to use the filesystems utilities to be able to read from all supported filesystems.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM