Apache Beam - 在两个无界 PCollections 上按键进行流连接

Question

我有两个 Unbounded（ KafkaIO ） PCollections ，我正在PCollections应用基于标签的CoGroupByKey ，固定窗口为 1 分钟，但是在大部分时间加入时，集合似乎错过了一些具有相同键的测试数据的标记数据之一. 请找到以下代码段。

    KafkaIO.Read<Integer, String> event1 = ... ;


    KafkaIO.Read<Integer, String> event2 = ...;

    PCollection<KV<String,String>> event1Data = p.apply(event1.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

    PCollection<KV<String,String>> event2Data = p.apply(event2.withoutMetadata())
            .apply(Values.<String>create())
            .apply(MapElements.via(new SimpleFunction<String, KV<String, String>>() {
                @Override public KV<String, String> apply(String input) {
                    log.info("Extracting Data");
                    . . . .//Some processing
                    return KV.of(record.get("myKey"), record.get("myValue"));
                }
            }))
            .apply(Window.<KV<String,String>>into(
                    FixedWindows.of(Duration.standardMinutes(1))));

   final TupleTag<String> event1Tag = new TupleTag<>();
   final TupleTag<String> event2Tag = new TupleTag<>();

   PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
            .of(event1Tag, event1Data)
            .and(event2Tag, event2Data)
            .apply(CoGroupByKey.<String>create());

   PCollection<String> finalResultCollection =
            kvpCollection.apply("Join", ParDo.of(
                    new DoFn<KV<String, CoGbkResult>, String>() {
                        @ProcessElement
                        public void processElement(ProcessContext c) throws IOException {
                            KV<String, CoGbkResult> e = c.element();
                            Iterable<String> event1Values = e.getValue().getAll(event1Tag);
                            Iterable<String> event2Values = e.getValue().getAll(event2Tag);
                            if( event1.iterator().hasNext() && event2.iterator().hasNext() ){
                               // Process event1 and event2 data and write to c.output
                            }else {
                                System.out.println("Unable to join event1 and event2");
                            }
                        }
                    }));

对于上面的代码，当我开始使用两个 kafka 主题的公共密钥抽取数据时，它永远不会加入，即Unable to join event1 and event2 ，如果我做错了什么，请告诉我，或者有更好的方法来加入两个无界公共密钥上的PCollection 。

Answer 1

我最近有类似的问题。 根据 Beam 文档，要在无界 PCollection（特别是键值 PCollection）上使用 CoGroupByKey 转换，所有 PCollection 都应该具有相同的窗口和触发策略。 因此，由于您正在使用流/无界集合，因此您必须根据您的触发策略在特定时间间隔后使用 Trigger 触发并发出窗口输出。 这个触发器应该连续触发，因为你在这里处理流数据，即永远重复使用你的触发器。 您还需要在您的窗口 PCollection 上应用累积/丢弃选项来告诉光束在触发触发器后应该做什么，即累积丢弃窗格的结果。 使用此窗口、触发和累加策略后，您应该使用 CoGroupByKey 转换使用公共密钥对多个无界 PCollection 进行分组。

像这样的事情：

PCollection<KV<String, Employee>> windowedCollection1
                    = collection1.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());


PCollection<KV<String, Department>> windowedCollection2
                    = collection2.apply(Window.<KV<String, DeliveryTimeWindow>>into(FixedWindows.of(Duration.standardMinutes(5)))
                    .triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
                    .withAllowedLateness(Duration.ZERO).accumulatingFiredPanes());

然后使用 CoGroupByKey ：

final TupleTag<Employee> t1 = new TupleTag<>();
final TupleTag<Department> t2 = new TupleTag<>();

PCollection<KV<String, CoGbkResult>> groupByKeyResult =
                    KeyedPCollectionTuple.of(t1,windowedCollection1)
.and(t2,windowedCollection2) 
                            .apply("Join Streams", CoGroupByKey.create());

现在您可以在 ParDo 变换中处理分组的 PCollection。

希望这有帮助！

Answer 2

我想我有点想通了这个问题，默认触发器是为CoGroupByKey的两个 Unbounded 源触发的，因此当有新事件到达两个源时，它试图立即应用连接操作，因为没有数据为我的 Steam 连接管道配置的驱动触发器。 我将所需的triggering() withAllowedLateness() discardingFiredPanes() withAllowedLateness()属性配置为我的Window函数，该函数解决了我的流连接用例。

Apache Beam - 在两个无界 PCollections 上按键进行流连接

问题描述

2 个解决方案

解决方案1
1 2020-04-30 14:39:11

解决方案2
0 2017-10-07 17:21:58

Apache Beam - 在两个无界 PCollections 上按键进行流连接

问题描述

2 个解决方案

解决方案1 1 2020-04-30 14:39:11

解决方案2 0 2017-10-07 17:21:58

解决方案1
1 2020-04-30 14:39:11

解决方案2
0 2017-10-07 17:21:58