Spark Structured Streaming Kafka Offset 管理

Question

I'm looking into storing kafka offsets inside of kafka for Spark Structured Streaming, like it's working for DStreams stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) , the same I'm looking but for Structured Streaming.我正在研究在 kafka 内部存储 kafka 偏移量以用于 Spark Structured Streaming，就像它适用于 DStreams stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)一样，我正在寻找相同但用于结构化流。 Is it supporting for structured streaming ?它是否支持结构化流媒体？ If yes, how can I achieve it ?如果是，我该如何实现？

I know about hdfs checkpointing using .option("checkpointLocation", checkpointLocation) , but I'm interested exactly for built-in offset management.我知道使用.option("checkpointLocation", checkpointLocation) hdfs 检查.option("checkpointLocation", checkpointLocation) ，但我对内置偏移管理完全感兴趣。

I'm expecting kafka to store offsets only inside without spark hdfs checkpoint.我期待 kafka 只在内部存储偏移量，而没有 spark hdfs 检查点。

Answer 1

I am using this piece of code found somewhere.我正在使用在某处找到的这段代码。

public class OffsetManager {

    private String storagePrefix;

    public OffsetManager(String storagePrefix) {
        this.storagePrefix = storagePrefix;
    }

    /**
     * Overwrite the offset for the topic in an external storage.
     *
     * @param topic     - Topic name.
     * @param partition - Partition of the topic.
     * @param offset    - offset to be stored.
     */
    void saveOffsetInExternalStore(String topic, int partition, long offset) {

        try {

            FileWriter writer = new FileWriter(storageName(topic, partition), false);

            BufferedWriter bufferedWriter = new BufferedWriter(writer);
            bufferedWriter.write(offset + "");
            bufferedWriter.flush();
            bufferedWriter.close();

        } catch (Exception e) {
            e.printStackTrace();
            throw new RuntimeException(e);
        }
    }

    /**
     * @return he last offset + 1 for the provided topic and partition.
     */
    long readOffsetFromExternalStore(String topic, int partition) {

        try {

            Stream<String> stream = Files.lines(Paths.get(storageName(topic, partition)));

            return Long.parseLong(stream.collect(Collectors.toList()).get(0)) + 1;

        } catch (Exception e) {
            e.printStackTrace();
        }

        return 0;
    }

    private String storageName(String topic, int partition) {
        return "Offsets\\" + storagePrefix + "-" + topic + "-" + partition;
    }

}

SaveOffset...is called after the record processing is successful otherwise no offset is stored. SaveOffset...在记录处理成功后调用，否则不存储偏移量。 and I am using Kafka topics as source so I specify the startingoffsets as the retrieved offsets from ReadOffsets...并且我使用 Kafka 主题作为源，因此我将起始偏移量指定为从 ReadOffsets 中检索到的偏移量...

Answer 2

"Is it supporting for structured streaming?" “它支持结构化流媒体吗？”

No, it is not supported in Structured Streaming to commit offsets back to Kafka, similar to what could be done using Spark Streaming (DStreams).不，Structured Streaming 不支持将偏移量提交回 Kafka，类似于使用 Spark Streaming (DStreams) 可以完成的操作。 The Spark Structured Streaming + Kafka Integration Guide on Kafka specific configurations is very precise about this:关于Kafka 特定配置的 Spark Structured Streaming + Kafka 集成指南对此非常精确：

"Kafka source doesn't commit any offset." “Kafka 源没有提交任何偏移量。”

I have written a more comprehensive answer about this in How to manually set groupId and commit Kafka offsets in Spark Structured Streaming .我在How to manual set groupId and commit Kafka offsets in Spark Structured Streaming 中写了一个更全面的答案。

Spark Structured Streaming Kafka Offset 管理

问题描述

2 个解决方案

解决方案1
0 2019-06-13 02:50:14

解决方案2
0 2021-01-21 16:23:12

Spark Structured Streaming Kafka Offset 管理

问题描述

2 个解决方案

解决方案1 0 2019-06-13 02:50:14

解决方案2 0 2021-01-21 16:23:12

解决方案1
0 2019-06-13 02:50:14

解决方案2
0 2021-01-21 16:23:12