簡體 English 中英

如何在Apache Beam中將文件讀取為byte []？

[英]How to read files as byte[] in Apache Beam?

原文 2017-08-16 07:14:02 8 1 java/ google-cloud-platform/ google-cloud-dataflow/ apache-beam/ apache-beam-io

我們目前正在研究Cloud Dataflow上的Apache Beam Pipeline概念驗證。 我們將一些文件（無文本；自定義二進制格式）放入Google Cloud Buckets，並希望將這些文件讀取為byte []並在流中反序列化它們。 但是，我們找不到能夠讀取非文本文件的Beam源。 唯一的想法是擴展FileBasedSource類，但是我們認為應該有一個更簡單的解決方案，因為這聽起來很簡單。

謝謝大家幫助。

1 個解決方案

這實際上是一個普遍有用的功能，目前在拉取請求中正在審核中＃3717

我一般都會回答，只是為了傳播信息。

FileBasedSource ， FileBasedSource和Beam的源抽象的主要目的是提供文件集合的靈活拆分，將其視為一個巨大的數據集，每行一條記錄。

如果每個文件只有一條記錄，則可以讀取ParDo(DoFn)的文件，從文件名到byte[] 。 由於任何PCollection支持在元素之間進行拆分，因此您將已經獲得拆分的最大好處。

由於數據流如何優化，你可能需要一個Reshuffle的'帕爾多之前變換。 這將確保讀取所有文件的並行性與任何上游轉換的並行性分離，從而將其名稱注入PCollection。

Apache Beam 如何對文件使用 TestStream

[英]Apache Beam How to use TestStream with files

Apache Beam-BigQueryIO讀取投影

[英]Apache Beam - BigQueryIO read Projection

如何從 apache 光束 java sdk 中的 minIO 讀取文件

[英]How to read a file from minIO in apache beam java sdk

如何使用 Apache Beam 從 RabbitMQ 讀取數據

[英]How to read data from RabbitMQ using Apache Beam

如何在Java中使用Apache Beam ParDo函數讀取JSON文件

[英]How to read a JSON file using Apache beam parDo function in Java

在Apache Beam中從GCS讀取文件

[英]Read a file from GCS in Apache Beam

JdbcIO.read 未在 apache 光束中返回結果

[英]JdbcIO.read is not returning results in apache beam

如何使用 Apache Beam 管理背壓

[英]How to manage backpressure with Apache Beam

Apache beam write轉換寫入多個文件？

[英]Apache beam write transform writes into multiple files?

Apache 光束通配符遞歸搜索文件

[英]Apache beam wildcard recursive search for files

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 Apache Beam 如何對文件使用 TestStream Apache Beam-BigQueryIO讀取投影如何從 apache 光束 java sdk 中的 minIO 讀取文件如何使用 Apache Beam 從 RabbitMQ 讀取數據如何在Java中使用Apache Beam ParDo函數讀取JSON文件在Apache Beam中從GCS讀取文件 JdbcIO.read 未在 apache 光束中返回結果如何使用 Apache Beam 管理背壓 Apache beam write轉換寫入多個文件？ Apache 光束通配符遞歸搜索文件

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM