简体   繁体   中英

How to read files as byte[] in Apache Beam?

we are currently working on a proof of concept Apache Beam Pipeline on Cloud Dataflow. We put some files (no text; a custom binary format) into Google Cloud Buckets and would like to read these files as byte[] and deserialize them in the flow. However, we cannot find a Beam source that is able to read non-text files. The only idea is to extend the FileBasedSource class, but we believe that there should be an easier solution, since this sound like a pretty straightforward task.

Thanks guys for your help.

This is actually a generally useful feature, currently under review in pull request #3717

I will answer generally anyhow, just to spread information.

The main purpose of the FileBasedSource , and Beam's source abstractions in general, is to provide flexible splitting of the collection of files, viewed as one huge data set with one record per line.

If you have just one record per file, then you can read the files in a ParDo(DoFn) from file names to byte[] . You will get the maximum benefit of splitting already, since splitting between elements is supported for any PCollection.

Because of how Dataflow optimizes, you may want a Reshuffle transform before your `ParDo. This will ensure that the parallelism of reading all the files is decoupled from the parallelism of whatever upstream transforms injects their names to the PCollection.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM