[英]How to write a Beam condition based on the size of a PCollection
I have a PCollection that contains a number of MyResult objects.我有一个包含许多 MyResult 对象的 PCollection。
PCollection<MyResult> myResultCollection = ....
I would like to check this PCollection so that if it is empty, then insert a dummy MyResult object in it.我想检查这个 PCollection,如果它为空,则在其中插入一个虚拟 MyResult 对象。
I know Count.Globally() can be used to count the size of this PCollection.我知道 Count.Globally() 可以用来计算这个 PCollection 的大小。 It returns a PCollection of a single LONG value.
它返回单个 LONG 值的 PCollection。
However, I have no idea how to extract the long value from the PCollection (probably not allowed) so that I can do something like this:但是,我不知道如何从 PCollection 中提取长值(可能不允许),以便我可以执行以下操作:
// Psudo-Code
PCollection<MyResult> myResultCollection = ....
PCollection<Long> sizeCollection = myResultCollection.apply(Count.globally());
Long size = sizeCollection.getValue() // I know this method does not exist
if(size == 0) {
myResultCollection.add(new MyResult());
}
return myResultCollection;
EDIT:编辑:
I tried to implement the idea @Louis suggested as below:我试图实现@Louis 建议的想法,如下所示:
public class MyDummyGeneration extends SimpleFunction<Long, MyClass> {
public MyClass apply(final Long resultCount) {
if(resultCount == 0) {
return MyUtils.createDummyMyClass();
} else {
return null; // This caused exception
}
}
}
public class MyClassPostProcessingTransform extends PTransform<PCollection<MyClass>, PCollection<MyClass>> {
public PCollection<MyClass> expand(final PCollection<MyClass> input) {
var count = input.apply(Count.globally());
var dummyPCollection = count.apply(MapElements.via(new MyDummyGeneration()));
var collections = PCollectionList.of(diffResult).and(dummyPCollection);
return collections.apply(Flattern.pCollections());
}
}
The return null;
return null;
caused an exception as it is not allowed.导致异常,因为它是不允许的。 I don't know how to represent the logic that if the length is not zero, I don't want the PCollection to contain any element.
我不知道如何表示如果长度不为零的逻辑,我不希望 PCollection 包含任何元素。
One big picture thing I want to clarify: when you write a Beam pipeline, all computation is deferred.我想澄清一件大事:当您编写 Beam 管道时,所有计算都会被延迟。 This is why
sizeCollection.getValue()
does not exist, because that would imply synchronization between the main program launching the pipeline and the running pipeline.这就是
sizeCollection.getValue()
不存在的原因,因为这意味着启动管道的主程序和正在运行的管道之间的同步。
The second thing is that we should start at your end-to-end need in order to understand how to do it best.第二件事是我们应该从您的端到端需求开始,以便了解如何做到最好。 Where does the data come from in the PCollection that may or may not be empty?
PCollection 中可能为空或不为空的数据来自哪里? What are you going to do downstream of it?
你打算在它的下游做什么?
A couple examples:几个例子:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.