帶有Kafka源和數據流運行器的Beam Java SDK 2.10.0：窗口Count.perElement永遠不會觸發數據

Question

我在Google DataFlow上將Beam SDK轉換為2.10.0作業時遇到問題

流程很簡單：我使用Kafka作為源，然后應用“固定”窗口，然后按鍵計算元素。 但是看起來數據直到工作被耗盡才永遠不會離開計數階段。 Count.PerElement/Combine.perKey(Count)/Combine.GroupedValues.out0輸出集合始終為零。 僅在耗盡Dataflow作業后才發布元素。

這是代碼：

public KafkaProcessingJob(BaseOptions options) {

    PCollection<GenericRecord> genericRecordPCollection = Pipeline.create(options)
                     .apply("Read binary Kafka messages", KafkaIO.<String, byte[]>read()
                           .withBootstrapServers(options.getBootstrapServers())
                           .updateConsumerProperties(configureConsumerProperties())
                           .withCreateTime(Duration.standardMinutes(1L))
                           .withTopics(inputTopics)
                           .withReadCommitted()
                           .commitOffsetsInFinalize()
                           .withKeyDeserializer(StringDeserializer.class)
                           .withValueDeserializer(ByteArrayDeserializer.class))

                    .apply("Map binary message to Avro GenericRecord", new DecodeBinaryKafkaMessage());

                    .apply("Apply windowing to records", Window.into(FixedWindows.of(Duration.standardMinutes(5)))
                                       .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
                                       .discardingFiredPanes()
                                       .withAllowedLateness(Duration.standardMinutes(5)))

                    .apply("Write aggregated data to BigQuery", MapElements.into(TypeDescriptors.strings()).via(rec -> getKey(rec)))
                            .apply(Count.<String>perElement())
                            .apply(
                                new WriteWindowedToBigQuery<>(
                                    project,
                                    dataset,
                                    table,
                                    configureWindowedTableWrite()));   
}

private Map<String, Object> configureConsumerProperties() {
    Map<String, Object> configUpdates = Maps.newHashMap();
    configUpdates.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

    return configUpdates;
}

private static String getKey(GenericRecord record) {
    //extract key
}

看起來流永遠不會離開.apply(Count.<String>perElement())

有人可以幫忙嗎？

Answer 1

我找到了原因。

它與此處使用的TimestampPolicy（ .withCreateTime(Duration.standardMinutes(1L)) ）有關。

由於我們的Kafka主題中存在空分區，因此從未使用默認的TimestampPolicy推進主題水印。 我需要實施自定義策略來解決此問題。

帶有Kafka源和數據流運行器的Beam Java SDK 2.10.0：窗口Count.perElement永遠不會觸發數據

問題描述

1 個解決方案

解決方案1
0 已采納 2019-03-01 12:06:13

帶有Kafka源和數據流運行器的Beam Java SDK 2.10.0：窗口Count.perElement永遠不會觸發數據

問題描述

1 個解決方案

解決方案1 0 已采納 2019-03-01 12:06:13

解決方案1
0 已采納 2019-03-01 12:06:13