简体   繁体   English

如何根据 PCollection 的大小编写 Beam 条件

[英]How to write a Beam condition based on the size of a PCollection

I have a PCollection that contains a number of MyResult objects.我有一个包含许多 MyResult 对象的 PCollection。

   PCollection<MyResult> myResultCollection = ....

I would like to check this PCollection so that if it is empty, then insert a dummy MyResult object in it.我想检查这个 PCollection,如果它为空,则在其中插入一个虚拟 MyResult 对象。

I know Count.Globally() can be used to count the size of this PCollection.我知道 Count.Globally() 可以用来计算这个 PCollection 的大小。 It returns a PCollection of a single LONG value.它返回单个 LONG 值的 PCollection。

However, I have no idea how to extract the long value from the PCollection (probably not allowed) so that I can do something like this:但是,我不知道如何从 PCollection 中提取长值(可能不允许),以便我可以执行以下操作:

 // Psudo-Code
 
 PCollection<MyResult> myResultCollection = ....
 PCollection<Long> sizeCollection = myResultCollection.apply(Count.globally());
 
 Long size = sizeCollection.getValue() // I know this method does not exist

 if(size == 0) {
     myResultCollection.add(new MyResult());
 }

 return myResultCollection;

EDIT:编辑:

I tried to implement the idea @Louis suggested as below:我试图实现@Louis 建议的想法,如下所示:

public class MyDummyGeneration extends SimpleFunction<Long, MyClass> { 
    public MyClass apply(final Long resultCount) {
       if(resultCount == 0) {
            return MyUtils.createDummyMyClass();
       } else {
            return null;    // This caused exception
       }
    }
}


public class MyClassPostProcessingTransform extends PTransform<PCollection<MyClass>, PCollection<MyClass>> {
     public PCollection<MyClass> expand(final PCollection<MyClass> input) {
         var count = input.apply(Count.globally());
         var dummyPCollection = count.apply(MapElements.via(new MyDummyGeneration()));
         var collections = PCollectionList.of(diffResult).and(dummyPCollection);
         return collections.apply(Flattern.pCollections());
     }    
}

The return null; return null; caused an exception as it is not allowed.导致异常,因为它是不允许的。 I don't know how to represent the logic that if the length is not zero, I don't want the PCollection to contain any element.我不知道如何表示如果长度不为零的逻辑,我不希望 PCollection 包含任何元素。

One big picture thing I want to clarify: when you write a Beam pipeline, all computation is deferred.我想澄清一件大事:当您编写 Beam 管道时,所有计算都会被延迟。 This is why sizeCollection.getValue() does not exist, because that would imply synchronization between the main program launching the pipeline and the running pipeline.这就是sizeCollection.getValue()不存在的原因,因为这意味着启动管道的主程序和正在运行的管道之间的同步。

The second thing is that we should start at your end-to-end need in order to understand how to do it best.第二件事是我们应该从您的端到端需求开始,以便了解如何做到最好。 Where does the data come from in the PCollection that may or may not be empty? PCollection 中可能为空或不为空的数据来自哪里? What are you going to do downstream of it?你打算在它的下游做什么?

A couple examples:几个例子:

  • if you are going to do an aggregation downstream, you can unconditionally insert a dummy element that is ignored in any nonempty aggregation如果要在下游进行聚合,则可以无条件插入一个在任何非空聚合中都会被忽略的虚拟元素
  • if you already have an aggregation upstream, you could view the result as a side input with a default value如果您已经有一个聚合上游,您可以将结果视为具有默认值的侧输入

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Beam runner 如何确定每束 PCollection 的大小 - How does a Beam runner determine the size of each bundle of a PCollection 如何转换 PCollection<tablerow> 到个人收藏<row>在 Apache 梁?</row></tablerow> - How to convert PCollection<TableRow> to PCollection<Row> in Apache Beam? 如何区分两个 PCollection Apache Beam - How to diff two PCollection Apache Beam 如何在 PCollection 中组合数据 - Apache Beam - How to combine Data in PCollection - Apache beam Apache Beam Wait.on JdbcIO.write 与无限 PCollection 问题 - Apache Beam Wait.on JdbcIO.write with unbounded PCollection issue 如何转换 PCollection<row> 在数据流 Apache 中使用 Java 束</row> - How to convert PCollection<Row> to Long in Dataflow Apache beam using Java 如何使用 Apache Beam 中的流输入 PCollection 请求 Redis 服务器? - How to request Redis server using a streaming input PCollection in Apache Beam? 如何从 PCollection 获取所有文件元数据<string>在光束中</string> - How to get all file metadata from PCollection<string> in beam 如何从 PCollection 中提取信息<row>加入 apache 光束后?</row> - How to extract information from PCollection<Row> after a join in apache beam? 如何转换 PCollection<row> 使用 Java 到数据流 Apache 中的 Integer</row> - How to convert PCollection<Row> to Integer in Dataflow Apache beam using Java
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM