Apache Beam - Stream Join by Key on two unbounded PCollections

Question

am having two Unbounded( KafkaIO ) PCollections for which am applying tag based CoGroupByKey with a fixed window of 1 min, however at the time of joining most of the time the collection seem to miss one of the tagged data for some test data having same keys. Please find the below snippet.

    KafkaIO.Read<Integer, String> event1 = ... ;


    KafkaIO.Read<Integer, String> event2 = ...;

    PCollection<KV<String,String>> event1Data = p.apply(event1.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

    PCollection<KV<String,String>> event2Data = p.apply(event2.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

   final TupleTag<String> event1Tag = new TupleTag<>();
   final TupleTag<String> event2Tag = new TupleTag<>();

   PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
            .of(event1Tag, event1Data)
            .and(event2Tag, event2Data)
            .apply(CoGroupByKey.<String>create());

   PCollection<String> finalResultCollection =
            kvpCollection.apply("Join", ParDo.of(
                    new DoFn<KV<String, CoGbkResult>, String>() {
                        @ProcessElement
                        public void processElement(ProcessContext c) throws IOException {
                            KV<String, CoGbkResult> e = c.element();
                            Iterable<String> event1Values = e.getValue().getAll(event1Tag);
                            Iterable<String> event2Values = e.getValue().getAll(event2Tag);
                            if( event1.iterator().hasNext() && event2.iterator().hasNext() ){
                               // Process event1 and event2 data and write to c.output
                            }else {
                                System.out.println("Unable to join event1 and event2");
                            }
                        }
                    }));

For the above code when I start pumping data with a common key for the two kafka topics, its never getting joined ie Unable to join event1 and event2 , kindly let me know if am doing anything wrong or is there a better way to join two unbounded PCollection on a common key.

Answer 1

I had similar issue recently. As per beam documentation, to use CoGroupByKey transfrom on unbounded PCollections (key-value PCollection, specifically), all the PCollection should have same windowing and trigger strategy. So you will have to use Trigger to fire and emit window output after certain interval based on your Triggering strategy since you are working with streaming/unbounded collections. This trigger should fire contineously since you are dealing with streaming data here ie use your Trigger repeatedly forever. You also need to apply accumulating/discarding option on your windowed PCollection to tell beam what should be done after trigger is fired ie to accumulate the result of discard the window pane. After using this windowing, trigger and accumulating strategy you should use CoGroupByKey transform to group multiple unbounded PCollection using a common key.

Something like this :

PCollection<KV<String, Employee>> windowedCollection1
                    = collection1.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());


PCollection<KV<String, Department>> windowedCollection2
                    = collection2.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());

Then use CoGroupByKey :

final TupleTag<Employee> t1 = new TupleTag<>();
final TupleTag<Department> t2 = new TupleTag<>();

PCollection<KV<String, CoGbkResult>> groupByKeyResult =
                    KeyedPCollectionTuple.of(t1,windowedCollection1)
.and(t2,windowedCollection2) 
                            .apply("Join Streams", CoGroupByKey.create());

now you can process your grouped PCollection in ParDo transform.

Hope this helps!

Answer 2

I guess I somewhat figured out the issue, the default trigger was getting triggered for the two Unbounded sources at CoGroupByKey hence as and when there was a new event arriving at the two sources it was trying to apply join operation immediately, as there were no Data Driven Triggers configured for my steam join pipeline. I configured the required triggering() discardingFiredPanes() withAllowedLateness() properties to my Window function which solved my stream join usecase.

Apache Beam - Stream Join by Key on two unbounded PCollections

Question

2 answers

solution1
1 2020-04-30 14:39:11

solution2
0 2017-10-07 17:21:58

Apache Beam - Stream Join by Key on two unbounded PCollections

Question

2 answers

solution1 1 2020-04-30 14:39:11

solution2 0 2017-10-07 17:21:58

solution1
1 2020-04-30 14:39:11

solution2
0 2017-10-07 17:21:58