重新启动Spark结构化流作业会消耗数百万个Kafka消息并消亡

Question

We have a Spark Streaming Application running on Spark 2.3.3 我们有一个运行在Spark 2.3.3上的Spark Streaming应用程序

Basically, it opens a Kafka Stream: 基本上，它将打开一个Kafka流：

  kafka_stream = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "mykafka:9092") \
  .option("subscribe", "mytopic") \
  .load()

The kafka topic has 2 partitions. kafka主题有2个分区。 After that, there are some basic filtering operations, some Python UDFs and an explode() on a column, like: 之后，在列上执行一些基本的过滤操作，一些Python UDF和explode（），例如：

   stream = apply_operations(kafka_stream)

where apply_operations does all the work on the data. 其中apply_operations将对数据进行所有工作。 In the end, we would like to write the stream to a sink, ie: 最后，我们想将流写入接收器，即：

   stream.writeStream \
   .format("our.java.sink.Class") \
   .option("some-option", "value") \
   .trigger(processingTime='15 seconds') \
   .start()

To let this stream operation run forever, we apply: 为了让此流操作永久运行，我们应用：

   spark.streams.awaitAnyTermination()

In the end. 到底。

So far, so good. 到目前为止，一切都很好。 Everything runs for days. 一切都运行了几天。 But due to a network problem, the job died for a few days, and there are now millions of messages in the kafka stream waiting to be catched up. 但是由于网络问题， 这项工作死了几天，现在卡夫卡流中有数百万条消息正在等待追赶。

When we restart the streaming data job using spark-submit, the first batch will be too large and will take ages to be completed. 当我们使用spark-submit重新启动流数据作业时，第一批将太大，并且需要很长时间才能完成。 We thought there might be a way to limit the size of the first batch with some parameter, but we did not find anything that helped. 我们认为可能有一种方法可以通过一些参数来限制第一批的大小，但是我们没有找到任何有用的方法。

We tried: 我们尝试了：

spark.streaming.backpressure.enabled=true along with spark.streaming.backpressure.initialRate=2000 and spark.streaming.kafka.maxRatePerPartition=1000 and spark.streaming.receiver.maxrate=2000 spark.streaming.backpressure.enabled = true以及spark.streaming.backpressure.initialRate = 2000和spark.streaming.kafka.maxRatePerPartition = 1000和spark.streaming.receiver.maxrate = 2000
setting spark.streaming.backpressure.pid.minrate to a lower value did not had an effect either 将spark.streaming.backpressure.pid.minrate设置为较低的值也没有效果
setting the option("maxOffsetsPerTrigger", 10000) did not had an effect as well 设置option（“ maxOffsetsPerTrigger”，10000）也不起作用

Now, after we restart the pipeline, sooner or later the whole Spark Job will crash again. 现在，在我们重新启动管道之后，迟早整个Spark Job都会再次崩溃。 We cannot simply widen the memory or the cores to be used for the spark job. 我们不能简单地扩展用于火花作业的内存或内核。

Is there anything we missed to control the amount of events beeing processed in one stream-batch? 我们是否有任何东西无法控制一个流批量中处理的事件数量？

Answer 1

You wrote in the comments that you are using spark-streaming-kafka-0-8_2.11 and that api version is not able to handle maxOffsetPerTrigger (or any other mechanism to reduce the number of consumed messages as far as I know) as it was only implemented for the newer api spark-streaming-kafka-0-10_2.11 . 您在注释中写道，您正在使用spark-streaming-kafka-0-8_2.11，并且该api版本无法处理maxOffsetPerTrigger（或据我所知其他任何减少消耗消息数量的机制）只实现了新的API 火花流-卡夫卡0-10_2.11 。 This newer api also works with your kafka version 0.10.2.2 according to the documentation . 根据文档，此更新的api也可以与您的kafka版本0.10.2.2一起使用。

重新启动Spark结构化流作业会消耗数百万个Kafka消息并消亡

问题描述

1 个解决方案

解决方案1
4 已采纳 2019-04-11 19:36:58

重新启动Spark结构化流作业会消耗数百万个Kafka消息并消亡

问题描述

1 个解决方案

解决方案1 4 已采纳 2019-04-11 19:36:58

解决方案1
4 已采纳 2019-04-11 19:36:58