简体   繁体   中英

How to write a Beam condition based on the size of a PCollection

I have a PCollection that contains a number of MyResult objects.

   PCollection<MyResult> myResultCollection = ....

I would like to check this PCollection so that if it is empty, then insert a dummy MyResult object in it.

I know Count.Globally() can be used to count the size of this PCollection. It returns a PCollection of a single LONG value.

However, I have no idea how to extract the long value from the PCollection (probably not allowed) so that I can do something like this:

 // Psudo-Code
 
 PCollection<MyResult> myResultCollection = ....
 PCollection<Long> sizeCollection = myResultCollection.apply(Count.globally());
 
 Long size = sizeCollection.getValue() // I know this method does not exist

 if(size == 0) {
     myResultCollection.add(new MyResult());
 }

 return myResultCollection;

EDIT:

I tried to implement the idea @Louis suggested as below:

public class MyDummyGeneration extends SimpleFunction<Long, MyClass> { 
    public MyClass apply(final Long resultCount) {
       if(resultCount == 0) {
            return MyUtils.createDummyMyClass();
       } else {
            return null;    // This caused exception
       }
    }
}


public class MyClassPostProcessingTransform extends PTransform<PCollection<MyClass>, PCollection<MyClass>> {
     public PCollection<MyClass> expand(final PCollection<MyClass> input) {
         var count = input.apply(Count.globally());
         var dummyPCollection = count.apply(MapElements.via(new MyDummyGeneration()));
         var collections = PCollectionList.of(diffResult).and(dummyPCollection);
         return collections.apply(Flattern.pCollections());
     }    
}

The return null; caused an exception as it is not allowed. I don't know how to represent the logic that if the length is not zero, I don't want the PCollection to contain any element.

One big picture thing I want to clarify: when you write a Beam pipeline, all computation is deferred. This is why sizeCollection.getValue() does not exist, because that would imply synchronization between the main program launching the pipeline and the running pipeline.

The second thing is that we should start at your end-to-end need in order to understand how to do it best. Where does the data come from in the PCollection that may or may not be empty? What are you going to do downstream of it?

A couple examples:

  • if you are going to do an aggregation downstream, you can unconditionally insert a dummy element that is ignored in any nonempty aggregation
  • if you already have an aggregation upstream, you could view the result as a side input with a default value

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM