简体   繁体   English

在Spark结构化流媒体中使用Kafka接收器时,检查点是否必须执行?

[英]Is checkpointing mandatory when using a Kafka sink in Spark Structured Streaming?

I'm trying to use Spark Structured Streaming to write aggregated data to Kafka. 我正在尝试使用Spark结构化流媒体将聚合数据写入Kafka。 Here's my code: 这是我的代码:

dataset
    .writeStream()
    .queryName(queryName)
    .outputMode(OutputMode.Append())
    .format("kafka")
    .option("kafka.bootstrap.servers", kafkaBootstrapServers)
    .option("topic", "topic")
    .trigger(Trigger.ProcessingTime("15 seconds"))
    // .option("checkpointLocation", checkpointLocation)
    .start();

If I comment out checkpointLocation , I get: 如果我注释掉checkpointLocation ,则会得到:

Exception in thread "main" org.apache.spark.sql.AnalysisException: checkpointLocation must be specified either through option("checkpointLocation", ...) or SparkSession.conf.set("spark.sql.streaming.checkpointLocation", ...);
    at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:210)
    at org.apache.spark.sql.streaming.StreamingQueryManager$$anonfun$3.apply(StreamingQueryManager.scala:205)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:204)
    at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
    at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:282)
    at <myClass>)

Is checkpointing mandatory when using a Kafka sink? 使用Kafka接收器时是否必须设置检查点? I could not find an answer in the documentation. 我在文档中找不到答案。

The checkpointing is needed to keep track what exactly was processed and written to a sink. 需要检查点来跟踪已处理和写入接收器的确切内容。

Let's assume you have a bunch of files in an input folder. 假设您在输入文件夹中有一堆文件。 When you start a stream spark starts processing files from source. 当您启动流时,spark将开始处理源文件。 To be sure that these files are processed and written to a sink only once it uses checkpointing where all progress information is stored. 为了确保仅在使用所有存储的进度信息的检查点时才处理这些文件并将其写入接收器。

In other words, checkpointing is needed not for the sink but for the entire stream to make sure that the same input data won't be processed over and over again. 换句话说,不是为接收器而是为整个流都需要检查点,以确保不会一次又一次地处理相同的输入数据。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 清除检查点位置后 Spark Structured Streaming 消耗的旧 Kafka Offset - Old Kafka Offset consuming by Spark Structured Streaming after clearing Checkpointing location Spark Structured Streaming foreach Sink 自定义编写器无法从 Kafka 主题读取数据 - Spark Structured Streaming foreach Sink custom writer is not able to read data from Kafka topic 如何使用Spark结构化流为Kafka流实现自定义反序列化器? - How to implement custom deserializer for Kafka stream using Spark structured streaming? 使用Java Kafka进行Spark结构化流式编程 - Spark Structured Streaming Programming with Kafka in Java 获取NotSerializableException - 将Spark Streaming与Kafka一起使用时 - Getting NotSerializableException - When using Spark Streaming with Kafka “格式错误的数据长度为负”,当尝试将来自 kafka 的 Spark 结构化流与 Avro 数据源结合使用时 - “Malformed data length is negative”, when trying to use spark structured streaming from kafka with Avro data source 如何使用 Java Spark 结构化流从 Kafka 主题正确消费 - How to consume correctly from Kafka topic with Java Spark structured streaming Spark结构化流获取最后一个Kafka分区的消息 - Spark Structured Streaming getting messages for last Kafka partition 如何从Spark结构化流媒体获取Kafka输出中的批次ID - How to get batch ID in Kafka output from Spark Structured Streaming Java Kafka结构化流 - Java Kafka Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM