简体   繁体   English

如何在Apache Beam中将文件读取为byte []?

[英]How to read files as byte[] in Apache Beam?

we are currently working on a proof of concept Apache Beam Pipeline on Cloud Dataflow. 我们目前正在研究Cloud Dataflow上的Apache Beam Pipeline概念验证。 We put some files (no text; a custom binary format) into Google Cloud Buckets and would like to read these files as byte[] and deserialize them in the flow. 我们将一些文件(无文本;自定义二进制格式)放入Google Cloud Buckets,并希望将这些文件读取为byte []并在流中反序列化它们。 However, we cannot find a Beam source that is able to read non-text files. 但是,我们找不到能够读取非文本文件的Beam源。 The only idea is to extend the FileBasedSource class, but we believe that there should be an easier solution, since this sound like a pretty straightforward task. 唯一的想法是扩展FileBasedSource类,但是我们认为应该有一个更简单的解决方案,因为这听起来很简单。

Thanks guys for your help. 谢谢大家帮助。

This is actually a generally useful feature, currently under review in pull request #3717 这实际上是一个普遍有用的功能,目前在拉取请求中正在审核中#3717

I will answer generally anyhow, just to spread information. 我一般都会回答,只是为了传播信息。

The main purpose of the FileBasedSource , and Beam's source abstractions in general, is to provide flexible splitting of the collection of files, viewed as one huge data set with one record per line. FileBasedSourceFileBasedSource和Beam的源抽象的主要目的是提供文件集合的灵活拆分,将其视为一个巨大的数据集,每行一条记录。

If you have just one record per file, then you can read the files in a ParDo(DoFn) from file names to byte[] . 如果每个文件只有一条记录,则可以读取ParDo(DoFn)的文件,从文件名到byte[] You will get the maximum benefit of splitting already, since splitting between elements is supported for any PCollection. 由于任何PCollection支持在元素之间进行拆分,因此您将已经获得拆分的最大好处。

Because of how Dataflow optimizes, you may want a Reshuffle transform before your `ParDo. 由于数据流如何优化,你可能需要一个Reshuffle的'帕尔多之前变换。 This will ensure that the parallelism of reading all the files is decoupled from the parallelism of whatever upstream transforms injects their names to the PCollection. 这将确保读取所有文件的并行性与任何上游转换的并行性分离,从而将其名称注入PCollection。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM