简体   繁体   中英

Beam java SDK 2.10.0 with Kafka source and Dataflow runner: windowed Count.perElement never fires data out

I have an issue running Beam SDK to 2.10.0 job on Google DataFlow

The flow is simple: I use Kafka as a source, then apply Fixed windows, then count element by key. But looks like data never leaves the stage of counting until the job is drained. Output collection of Count.PerElement/Combine.perKey(Count)/Combine.GroupedValues.out0 is always zero. Elements are issued only after draining Dataflow job.

Here is the code:

public KafkaProcessingJob(BaseOptions options) {

    PCollection<GenericRecord> genericRecordPCollection = Pipeline.create(options)
                     .apply("Read binary Kafka messages", KafkaIO.<String, byte[]>read()
                           .withBootstrapServers(options.getBootstrapServers())
                           .updateConsumerProperties(configureConsumerProperties())
                           .withCreateTime(Duration.standardMinutes(1L))
                           .withTopics(inputTopics)
                           .withReadCommitted()
                           .commitOffsetsInFinalize()
                           .withKeyDeserializer(StringDeserializer.class)
                           .withValueDeserializer(ByteArrayDeserializer.class))

                    .apply("Map binary message to Avro GenericRecord", new DecodeBinaryKafkaMessage());

                    .apply("Apply windowing to records", Window.into(FixedWindows.of(Duration.standardMinutes(5)))
                                       .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
                                       .discardingFiredPanes()
                                       .withAllowedLateness(Duration.standardMinutes(5)))

                    .apply("Write aggregated data to BigQuery", MapElements.into(TypeDescriptors.strings()).via(rec -> getKey(rec)))
                            .apply(Count.<String>perElement())
                            .apply(
                                new WriteWindowedToBigQuery<>(
                                    project,
                                    dataset,
                                    table,
                                    configureWindowedTableWrite()));   
}

private Map<String, Object> configureConsumerProperties() {
    Map<String, Object> configUpdates = Maps.newHashMap();
    configUpdates.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

    return configUpdates;
}

private static String getKey(GenericRecord record) {
    //extract key
}

Looks like flow never leaves the stage of .apply(Count.<String>perElement())

Can somebody help?

I have found the cause.

It is related to the TimestampPolicy used here ( .withCreateTime(Duration.standardMinutes(1L)) ).

Due to presence of empty partitions in our Kafka topics, topic watermark was never advanced using the default TimestampPolicy. I needed to implement custom policy to solve the issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM