简体   繁体   中英

Apache Beam - Dataflow - Serialization & state sharing

In one of my pipelin's DoFn , I'm downloading binary files, that need to be processed by another DoFn . Right now once the binary file is downloaded, I also store it in GCS and I output the location of the file to my downstream DoFn . However the upload to GCS is taking quite a long time, and I'm not even sure I need that.

Is there a way to make my binary buffer available to downstream DoFn without any serialization ? I'd basically like to have the workers on the same machine, and share data through RAM. Is that possible ?

If not, am I wrong in using GCS for data sharing between DoFNs ? Can we use directly the file system ?

The best practice here is to pass the data directly as a byte array value. The framework should correctly handle the passing the buffer in memory between fused stages that do not contain an intervening GroupByKey.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM