简体   繁体   English

将Apache Beam的PCollection对象收集到驱动程序的内存中

[英]Collecting the Apache Beam's PCollection objects into driver's memory

Is it possible to collect the objects within a PCollection in Apache Beam into the driver's memory? 是否可以将Apache Beam中PCollection中的对象收集到驱动程序的内存中? Something like: 就像是:

PCollection<String> distributedWords = ...
List<String> localWords = distributedWords.collect();

I borrowed the method here from Apache Spark, but I was wondering if Apache Beam has a similar functionality as well or not!? 我在这里从Apache Spark借用了该方法,但是我想知道Apache Beam是否也具有类似的功能!

Not directly. 不直接。 The pipeline can write the output into a sink (eg GCS bucket or BigQuery table), and signal the progress to the driver program, if needed, via something like PubSub. 管道可以将输出写入接收器(例如GCS存储桶或BigQuery表),并在需要时通过诸如PubSub之类的信号向驱动程序发送信号。 Then driver program reads from the saved data from the common source. 然后,驱动程序从通用源读取保存的数据。 This approach will work for all Beam runners. 这种方法适用于所有Beam跑步者。

There may be other workarounds for specific cases. 针对特定情况,可能还有其他解决方法。 For example, DirectRunner is a local in-memory execution engine that runs your pipeline locally in-process in a sequential manner. 例如,DirectRunner是一个本地内存执行引擎,该引擎以顺序方式在本地进程中运行管道。 It is used mostly for testing, and if it fits your use case you can leverage it, eg by storing the processed data in a shared in-memory storage that can be accessed by both the driver program and the pipeline execution logic, eg see TestTable . 它主要用于测试,如果适合您的用例,则可以利用它,例如,通过将处理后的数据存储在共享的内存中,驱动程序和管道执行逻辑都可以访问它,例如,请参见TestTable This won't work in other runners. 这在其他跑步者中不起作用。

In general, Pipeline execution can happen in parallel, and specifics of how it happens is controlled by the runner (eg Flink, Dataflow or Spark). 通常,管道执行可以并行发生,而如何执行则由运行程序(例如Flink,Dataflow或Spark)控制。 Beam pipeline is just a definition of the transformations you're applying to your data plus data sources and sinks. 梁管道只是对要应用于数据以及数据源和接收器的转换的定义。 Your driver program doesn't read or collect data itself, and doesn't communicate to the execution nodes directly, it basically only sends the pipeline definition to the runner that then decides how to execute it, potentially spreading it across the fleet of machines (or uses other execution primitives to run it). 您的驱动程序本身不会读取或收集数据,也不会直接与执行节点通信,它基本上只会将管道定义发送给运行程序,然后由运行程序决定如何执行它,并可能将其分散到整个机器中(或使用其他执行原语来运行它)。 And then each execution node can independently process the data by extracting it from the input source, transforming and then writing it to the output. 然后,每个执行节点可以通过从输入源中提取数据,进行转换然后将其写入输出中来独立处理数据。 The node in general doesn't know about the driver program, it only knows how to execute the pipeline definition. 通常,该节点不了解驱动程序,仅知道如何执行管道定义。 Execution environments / runners can be very different and there's no requirement at the moment for runners to implement such collection mechanism. 执行环境/运行程序可能非常不同,并且当前没有要求运行程序实现这种收集机制。 See https://beam.apache.org/documentation/execution-model/ 参见https://beam.apache.org/documentation/execution-model/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Beam:扁平化 PCollection <List<Foo> &gt; 到 PCollection<Foo> - Apache Beam: Flattening PCollection<List<Foo>> to PCollection<Foo> 如何区分两个 PCollection Apache Beam - How to diff two PCollection Apache Beam 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam Apache Beam - 使用无界PCollection进行集成测试 - Apache Beam - Integration test with unbounded PCollection 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? Apache Beam 创建具有抽象字段的自定义实体/模型的 PCollection - Apache Beam creating PCollection of Custom Entities/Models with Abstract Fields 如何为 PCollection 设置编码器<List<String> &gt; 在 Apache Beam 中? - How do I set the coder for a PCollection<List<String>> in Apache Beam? Apache Beam Wait.on JdbcIO.write 与无限 PCollection 问题 - Apache Beam Wait.on JdbcIO.write with unbounded PCollection issue 如何将 JSON Array 反序列化为 Apache beam PCollection<javaobject></javaobject> - How to deserialize JSON Array to Apache beam PCollection<javaObject>
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM