I have a simple Apache Beam pipeline which reads compressed bz2 files and writes them out to text files.
import apache_beam as beam
p1 = beam.Pipeline()
(p1
| 'read' >> beam.io.ReadFromText('bad_file.bz2')
| 'write' >> beam.io.WriteToText('file_out.txt')
)
p1.run()
The problem is when the pipeline encounters a bad file ( example ). In this case, most of my bad files are malformed, not in bz2 format or simply empty, which confuses the decompressor, causing an OSError: Invalid data stream
.
How can I tell ReadFromText to pass
on these?
You may want to filter your files and then use apache_beam.io.textio.ReadAllFromText .
For example
with beam.Pipeline() as p:
lines = (p
| beam.MatchFiles("/path/to/*.bz2")
| beam.Filter(lambda m: is_valid_bz2(m.path))
| apache_beam.io.textio.ReadAllFromText())
Your is_valid_bz2
may want to use the filesystems utilities to be able to read from all supported filesystems.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.