简体繁体中英

How to read files as byte[] in Apache Beam?

原文 2017-08-16 07:14:02 0 1 java/ google-cloud-platform/ google-cloud-dataflow/ apache-beam/ apache-beam-io

we are currently working on a proof of concept Apache Beam Pipeline on Cloud Dataflow. We put some files (no text; a custom binary format) into Google Cloud Buckets and would like to read these files as byte[] and deserialize them in the flow. However, we cannot find a Beam source that is able to read non-text files. The only idea is to extend the FileBasedSource class, but we believe that there should be an easier solution, since this sound like a pretty straightforward task.

Thanks guys for your help.

1 answers

This is actually a generally useful feature, currently under review in pull request #3717

I will answer generally anyhow, just to spread information.

The main purpose of the FileBasedSource , and Beam's source abstractions in general, is to provide flexible splitting of the collection of files, viewed as one huge data set with one record per line.

If you have just one record per file, then you can read the files in a ParDo(DoFn) from file names to byte[] . You will get the maximum benefit of splitting already, since splitting between elements is supported for any PCollection.

Because of how Dataflow optimizes, you may want a Reshuffle transform before your `ParDo. This will ensure that the parallelism of reading all the files is decoupled from the parallelism of whatever upstream transforms injects their names to the PCollection.

Apache Beam How to use TestStream with files

Apache Beam - BigQueryIO read Projection

How to read a file from minIO in apache beam java sdk

How to read data from RabbitMQ using Apache Beam

How to read a JSON file using Apache beam parDo function in Java

Read a file from GCS in Apache Beam

JdbcIO.read is not returning results in apache beam

How to manage backpressure with Apache Beam

Apache beam write transform writes into multiple files?

Apache beam wildcard recursive search for files

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Apache Beam How to use TestStream with files Apache Beam - BigQueryIO read Projection How to read a file from minIO in apache beam java sdk How to read data from RabbitMQ using Apache Beam How to read a JSON file using Apache beam parDo function in Java Read a file from GCS in Apache Beam JdbcIO.read is not returning results in apache beam How to manage backpressure with Apache Beam Apache beam write transform writes into multiple files? Apache beam wildcard recursive search for files

Related Tags

How to read files as byte[] in Apache Beam?

Question

1 answers

solution1 2 ACCPTED 2017-08-16 15:00:53

solution1
2 ACCPTED 2017-08-16 15:00:53