如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置？

Question

Based on the introduction in Spark 3.0, https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html .基于 Spark 3.0 中的介绍， https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html 。 It should be possible to set "kafka.group.id" to track the offset.应该可以设置“kafka.group.id”来跟踪偏移量。 For our use case, I want to avoid the potential data loss if the streaming spark job failed and restart.对于我们的用例，如果流式 Spark 作业失败并重新启动，我想避免潜在的数据丢失。 Based on my previous questions, I have a feeling that kafka.group.id in Spark 3.0 is something that will help.根据我之前的问题，我觉得 Spark 3.0 中的 kafka.group.id 会有所帮助。

How to specify the group id of kafka consumer for spark structured streaming? 如何为 Spark 结构化流指定 kafka 消费者的组 ID？

How to ensure no data loss for kafka data ingestion through Spark Structured Streaming? 如何通过 Spark Structured Streaming 确保 kafka 数据摄取不会丢失数据？

However, I tried the settings in spark 3.0 as below.但是，我在 spark 3.0 中尝试了如下设置。

package com.example

/**
 * @author ${user.name}
 */
import scala.math.random

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, BooleanType, LongType}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SaveMode
import org.apache.spark.SparkFiles
import java.util.Properties
import org.postgresql.Driver
import org.apache.spark.sql.streaming.Trigger
import java.time.Instant
import org.apache.hadoop.fs.{FileSystem, Path}
import java.net.URI
import java.sql.Connection
import java.sql.DriverManager
import java.sql.ResultSet
import java.sql.SQLException
import java.sql.Statement


//import org.apache.spark.sql.hive.HiveContext

import scala.io.Source

import java.nio.charset.StandardCharsets

import com.amazonaws.services.kms.{AWSKMS, AWSKMSClientBuilder}
import com.amazonaws.services.kms.model.DecryptRequest
import java.nio.ByteBuffer
import com.google.common.io.BaseEncoding


object App {
    
    def main(args: Array[String]): Unit = {
      
      val spark: SparkSession = SparkSession.builder()
        .appName("MY-APP")
        .getOrCreate()

      import spark.sqlContext.implicits._

      spark.catalog.clearCache()
      spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
      spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

      spark.sparkContext.setLogLevel("ERROR")
      spark.sparkContext.setCheckpointDir("/home/ec2-user/environment/spark/spark-local/checkpoint")
      
      System.gc()
      
      val df = spark.readStream
        .format("kafka")
          .option("kafka.bootstrap.servers", "mybroker.io:6667")
          .option("subscribe", "mytopic")
          .option("kafka.security.protocol", "SASL_SSL")
          .option("kafka.ssl.truststore.location", "/home/ec2-user/environment/spark/spark-local/creds/cacerts")
          .option("kafka.ssl.truststore.password", "changeit")
          .option("kafka.ssl.truststore.type", "JKS")
          .option("kafka.sasl.kerberos.service.name", "kafka")
          .option("kafka.sasl.mechanism", "GSSAPI")
          .option("kafka.group.id","MYID")
          .load()

      df.printSchema()

      
      val schema = new StructType()
        .add("id", StringType)
        .add("x", StringType)
        .add("eventtime", StringType)

      val idservice = df.selectExpr("CAST(value AS STRING)")
        .select(from_json(col("value"), schema).as("data"))
        .select("data.*")

       
      val monitoring_df = idservice
                .selectExpr("cast(id as string) id", 
                            "cast(x as string) x",
                            "cast(eventtime as string) eventtime")              

      val monitoring_stream = monitoring_df.writeStream
                              .trigger(Trigger.ProcessingTime("120 seconds"))
                              .foreachBatch { (batchDF: DataFrame, batchId: Long) =>
                                if(!batchDF.isEmpty) 
                                {
                                    batchDF.persist()
                                    printf("At %d, the %dth microbatch has %d records and %d partitions \n", Instant.now.getEpochSecond, batchId, batchDF.count(), batchDF.rdd.partitions.size)                                    
                                    batchDF.show()

                                    batchDF.write.mode(SaveMode.Overwrite).option("path", "/home/ec2-user/environment/spark/spark-local/tmp").saveAsTable("mytable")
                                    spark.catalog.refreshTable("mytable")
                                    
                                    batchDF.unpersist()
                                    spark.catalog.clearCache()
                                }
                            }
                            .start()
                            .awaitTermination()
    }
   
}

The spark job is tested in the standalone mode by using below spark-submit command, but the same problem exists when I deploy in cluster mode in AWS EMR.使用以下 spark-submit 命令在独立模式下测试 spark 作业，但在 AWS EMR 中以集群模式部署时存在同样的问题。

spark-submit --master local[1] --files /home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf,/home/ec2-user/environment/spark/spark-localreds/cacerts,/home/ec2-user/environment/spark/spark-local/creds/krb5.conf,/home/ec2-user/environment/spark/spark-local/creds/my.keytab --driver-java-options "-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf spark.dynamicAllocation.enabled=false --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/ec2-user/environment/spark/spark-local/creds/client_jaas.conf -Djava.security.krb5.conf=/home/ec2-user/environment/spark/spark-local/creds/krb5.conf" --conf spark.yarn.maxAppAttempts=1000 --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 --class com.example.App ./target/sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar

Then, I started the streaming job to read the streaming data from Kafka topic.然后，我开始流式作业以从 Kafka 主题读取流式数据。 After some time, I killed the spark job.一段时间后，我终止了火花作业。 Then, I wait for 1 hour to start the job again.然后，我等待 1 小时再次开始工作。 If I understand correctly, the new streaming data should start from the offset when I killed the spark job.如果我理解正确，新的流数据应该从我终止 spark 作业时的偏移量开始。 However, it still starts as the latest offset, which caused data loss during the time I stopped the job.但是，它仍然作为最新的偏移量开始，这在我停止作业期间导致数据丢失。

Do I need to configure more options to avoid data loss?我是否需要配置更多选项以避免数据丢失？ Or do I have some misunderstanding for the Spark 3.0?还是我对 Spark 3.0 有什么误解？ Thanks!谢谢！

Problem solved问题解决了

The key issue here is that the checkpoint must be added to the query specifically.这里的关键问题是检查点必须专门添加到查询中。 To just add checkpoint for SparkContext is not enough.仅仅为 SparkContext 添加检查点是不够的。 After adding the checkpoint, it is working.添加检查点后，它正在工作。 In the checkpoint folder, it will create a offset subfolder, which contains offset file, 0, 1, 2, 3.... For each file, it will show the offset information for different partition.在checkpoint文件夹中，会创建一个offset子文件夹，里面有offset文件，0,1,2,3....对于每个文件，会显示不同分区的offset信息。

{"8":109904920,"2":109905750,"5":109905789,"4":109905621,"7":109905330,"1":109905746,"9":109905750,"3":109905936,"6":109905531,"0":109905583}}

One suggestion is to put the checkpoint to some external storage, such as s3.一种建议是将检查点放置到某个外部存储，例如 s3。 It can help recover the offset even when you need to rebuild the EMR cluster itself in case.即使在您需要重建 EMR 集群本身的情况下，它也可以帮助恢复偏移量。

Answer 1

According to the Spark Structured Integration Guide , Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka.根据Spark Structured Integration Guide ，Spark 本身会跟踪偏移量，并且没有提交回 Kafka 的偏移量。 That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files.这意味着如果您的 Spark Streaming 作业失败并且您重新启动它，所有关于偏移量的必要信息都存储在 Spark 的检查点文件中。

Even if you set the ConsumerGroup name with kafka.group.id , your application will still not commit the messages back to Kafka.即使您使用kafka.group.id设置 ConsumerGroup 名称，您的应用程序仍不会将消息提交回 Kafka。 The information on the next offset to read is only available in the checkpointing files of your Spark application.有关要读取的下一个偏移量的信息仅在 Spark 应用程序的检查点文件中可用。

If you stop and restart your application without a re-deployment and ensure that you do not delete old checkpoint files, your application will continue reading from where it left off.如果您在没有重新部署的情况下停止并重新启动您的应用程序，并确保您没有删除旧的检查点文件，您的应用程序将从停止的地方继续读取。

In the Spark Structured Streaming documentation on Recovering from Failures with Checkpointing it is written that:在有关使用检查点从故障中恢复的 Spark Structured Streaming 文档中，它写道：

"In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (ie range of offsets processed in each trigger) [...]" “如果发生故障或故意关闭，您可以恢复先前查询的先前进度和状态，并从中断处继续。这是使用检查点和预写日志完成的。您可以使用检查点位置配置查询，并且查询将保存所有进度信息（即每个触发器中处理的偏移范围）[...]”

This can be achieved by setting the following option in your writeStream query (it is not sufficient to set the checkpoint directory in your SparkContext configurations):这可以通过设置下列选项中您可以实现writeStream查询（它是不足以设置检查点目录中的SparkContext配置）：

.option("checkpointLocation", "path/to/HDFS/dir")

In the docs it is also noted that "This checkpoint location has to be a path in an HDFS compatible file system , and can be set as an option in the DataStreamWriter when starting a query."在文档中还指出“此检查点位置必须是HDFS 兼容文件系统中的路径，并且可以在启动查询时设置为 DataStreamWriter 中的选项。”

In addition, the fault tolerance capabilities of Spark Structured Streaming also depends on your output sink as described in section Output Sinks .此外，Spark Structured Streaming 的容错能力还取决于您的输出接收器，如输出接收器部分所述。

As you are currently using the ForeachBatch Sink, you might not have restart capabilities in your application.由于您当前正在使用ForeachBatch Sink，因此您的应用程序中可能没有重新启动功能。

如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置？

问题描述

1 个解决方案

解决方案1
5 已采纳 2020-09-22 05:29:09

如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置？

问题描述

1 个解决方案

解决方案1 5 已采纳 2020-09-22 05:29:09

解决方案1
5 已采纳 2020-09-22 05:29:09