简体   繁体   中英

Collecting the Apache Beam's PCollection objects into driver's memory

Is it possible to collect the objects within a PCollection in Apache Beam into the driver's memory? Something like:

PCollection<String> distributedWords = ...
List<String> localWords = distributedWords.collect();

I borrowed the method here from Apache Spark, but I was wondering if Apache Beam has a similar functionality as well or not!?

Not directly. The pipeline can write the output into a sink (eg GCS bucket or BigQuery table), and signal the progress to the driver program, if needed, via something like PubSub. Then driver program reads from the saved data from the common source. This approach will work for all Beam runners.

There may be other workarounds for specific cases. For example, DirectRunner is a local in-memory execution engine that runs your pipeline locally in-process in a sequential manner. It is used mostly for testing, and if it fits your use case you can leverage it, eg by storing the processed data in a shared in-memory storage that can be accessed by both the driver program and the pipeline execution logic, eg see TestTable . This won't work in other runners.

In general, Pipeline execution can happen in parallel, and specifics of how it happens is controlled by the runner (eg Flink, Dataflow or Spark). Beam pipeline is just a definition of the transformations you're applying to your data plus data sources and sinks. Your driver program doesn't read or collect data itself, and doesn't communicate to the execution nodes directly, it basically only sends the pipeline definition to the runner that then decides how to execute it, potentially spreading it across the fleet of machines (or uses other execution primitives to run it). And then each execution node can independently process the data by extracting it from the input source, transforming and then writing it to the output. The node in general doesn't know about the driver program, it only knows how to execute the pipeline definition. Execution environments / runners can be very different and there's no requirement at the moment for runners to implement such collection mechanism. See https://beam.apache.org/documentation/execution-model/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM