简体   繁体   English

Apache Beam - 在两个无界 PCollections 上按键进行流连接

[英]Apache Beam - Stream Join by Key on two unbounded PCollections

am having two Unbounded( KafkaIO ) PCollections for which am applying tag based CoGroupByKey with a fixed window of 1 min, however at the time of joining most of the time the collection seem to miss one of the tagged data for some test data having same keys.我有两个 Unbounded( KafkaIOPCollections ,我正在PCollections应用基于标签的CoGroupByKey ,固定窗口为 1 分钟,但是在大部分时间加入时,集合似乎错过了一些具有相同键的测试数据的标记数据之一. Please find the below snippet.请找到以下代码段。

    KafkaIO.Read<Integer, String> event1 = ... ;


    KafkaIO.Read<Integer, String> event2 = ...;

    PCollection<KV<String,String>> event1Data = p.apply(event1.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

    PCollection<KV<String,String>> event2Data = p.apply(event2.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

   final TupleTag<String> event1Tag = new TupleTag<>();
   final TupleTag<String> event2Tag = new TupleTag<>();

   PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
            .of(event1Tag, event1Data)
            .and(event2Tag, event2Data)
            .apply(CoGroupByKey.<String>create());

   PCollection<String> finalResultCollection =
            kvpCollection.apply("Join", ParDo.of(
                    new DoFn<KV<String, CoGbkResult>, String>() {
                        @ProcessElement
                        public void processElement(ProcessContext c) throws IOException {
                            KV<String, CoGbkResult> e = c.element();
                            Iterable<String> event1Values = e.getValue().getAll(event1Tag);
                            Iterable<String> event2Values = e.getValue().getAll(event2Tag);
                            if( event1.iterator().hasNext() && event2.iterator().hasNext() ){
                               // Process event1 and event2 data and write to c.output
                            }else {
                                System.out.println("Unable to join event1 and event2");
                            }
                        }
                    }));

For the above code when I start pumping data with a common key for the two kafka topics, its never getting joined ie Unable to join event1 and event2 , kindly let me know if am doing anything wrong or is there a better way to join two unbounded PCollection on a common key.对于上面的代码,当我开始使用两个 kafka 主题的公共密钥抽取数据时,它永远不会加入,即Unable to join event1 and event2 ,如果我做错了什么,请告诉我,或者有更好的方法来加入两个无界公共密钥上的PCollection

I had similar issue recently.我最近有类似的问题。 As per beam documentation, to use CoGroupByKey transfrom on unbounded PCollections (key-value PCollection, specifically), all the PCollection should have same windowing and trigger strategy.根据 Beam 文档,要在无界 PCollection(特别是键值 PCollection)上使用 CoGroupByKey 转换,所有 PCollection 都应该具有相同的窗口和触发策略。 So you will have to use Trigger to fire and emit window output after certain interval based on your Triggering strategy since you are working with streaming/unbounded collections.因此,由于您正在使用流/无界集合,因此您必须根据您的触发策略在特定时间间隔后使用 Trigger 触发并发出窗口输出。 This trigger should fire contineously since you are dealing with streaming data here ie use your Trigger repeatedly forever.这个触发器应该连续触发,因为你在这里处理流数据,即永远重复使用你的触发器。 You also need to apply accumulating/discarding option on your windowed PCollection to tell beam what should be done after trigger is fired ie to accumulate the result of discard the window pane.您还需要在您的窗口 PCollection 上应用累积/丢弃选项来告诉光束在触发触发器后应该做什么,即累积丢弃窗格的结果。 After using this windowing, trigger and accumulating strategy you should use CoGroupByKey transform to group multiple unbounded PCollection using a common key.使用此窗口、触发和累加策略后,您应该使用 CoGroupByKey 转换使用公共密钥对多个无界 PCollection 进行分组。

Something like this :像这样的事情:

PCollection<KV<String, Employee>> windowedCollection1
                    = collection1.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());


PCollection<KV<String, Department>> windowedCollection2
                    = collection2.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());

Then use CoGroupByKey :然后使用 CoGroupByKey :

final TupleTag<Employee> t1 = new TupleTag<>();
final TupleTag<Department> t2 = new TupleTag<>();

PCollection<KV<String, CoGbkResult>> groupByKeyResult =
                    KeyedPCollectionTuple.of(t1,windowedCollection1)
.and(t2,windowedCollection2) 
                            .apply("Join Streams", CoGroupByKey.create());

now you can process your grouped PCollection in ParDo transform.现在您可以在 ParDo 变换中处理分组的 PCollection。

Hope this helps!希望这有帮助!

I guess I somewhat figured out the issue, the default trigger was getting triggered for the two Unbounded sources at CoGroupByKey hence as and when there was a new event arriving at the two sources it was trying to apply join operation immediately, as there were no Data Driven Triggers configured for my steam join pipeline.我想我有点想通了这个问题,默认触发器是为CoGroupByKey的两个 Unbounded 源触发的,因此当有新事件到达两个源时,它试图立即应用连接操作,因为没有数据为我的 Steam 连接管道配置的驱动触发器。 I configured the required triggering() discardingFiredPanes() withAllowedLateness() properties to my Window function which solved my stream join usecase.我将所需的triggering() withAllowedLateness() discardingFiredPanes() withAllowedLateness()属性配置为我的Window函数,该函数解决了我的流连接用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM