I have a PCollection that contains a number of MyResult objects.
PCollection<MyResult> myResultCollection = ....
I would like to check this PCollection so that if it is empty, then insert a dummy MyResult object in it.
I know Count.Globally() can be used to count the size of this PCollection. It returns a PCollection of a single LONG value.
However, I have no idea how to extract the long value from the PCollection (probably not allowed) so that I can do something like this:
// Psudo-Code
PCollection<MyResult> myResultCollection = ....
PCollection<Long> sizeCollection = myResultCollection.apply(Count.globally());
Long size = sizeCollection.getValue() // I know this method does not exist
if(size == 0) {
myResultCollection.add(new MyResult());
}
return myResultCollection;
EDIT:
I tried to implement the idea @Louis suggested as below:
public class MyDummyGeneration extends SimpleFunction<Long, MyClass> {
public MyClass apply(final Long resultCount) {
if(resultCount == 0) {
return MyUtils.createDummyMyClass();
} else {
return null; // This caused exception
}
}
}
public class MyClassPostProcessingTransform extends PTransform<PCollection<MyClass>, PCollection<MyClass>> {
public PCollection<MyClass> expand(final PCollection<MyClass> input) {
var count = input.apply(Count.globally());
var dummyPCollection = count.apply(MapElements.via(new MyDummyGeneration()));
var collections = PCollectionList.of(diffResult).and(dummyPCollection);
return collections.apply(Flattern.pCollections());
}
}
The return null;
caused an exception as it is not allowed. I don't know how to represent the logic that if the length is not zero, I don't want the PCollection to contain any element.
One big picture thing I want to clarify: when you write a Beam pipeline, all computation is deferred. This is why sizeCollection.getValue()
does not exist, because that would imply synchronization between the main program launching the pipeline and the running pipeline.
The second thing is that we should start at your end-to-end need in order to understand how to do it best. Where does the data come from in the PCollection that may or may not be empty? What are you going to do downstream of it?
A couple examples:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.