简体   繁体   English

Spark结构化流:当前批次落后

[英]Spark structured streaming: Current batch is falling behind

It seems be very straightforward implementation, but looks like there are some issues. 这似乎是非常简单的实现,但是看起来有些问题。

This job reads offsets (ui event data) from kafka topic, does some aggregation and writes it to Aerospike database. 此作业从kafka主题读取偏移量(ui事件数据),进行一些聚合并将其写入Aerospike数据库。

In case of high traffic I start seeing this issue where the job is running fine but no new data is being inserted. 如果流量很高,我会开始看到此问题,其中作业运行正常,但没有插入新数据。 Looking at the logs I see this WARNING messages: 查看日志,我看到以下警告消息:

Current batch is falling behind. 当前批次落后。 The trigger interval is 30000 milliseconds, but spent 43491 milliseconds 触发间隔为30000毫秒,但花费了43491毫秒

Few times job resumes writing data but I can see the counts are low which indicates that there is some data loss. 几乎没有时间作业恢复写数据,但是我可以看到计数很低,这表明存在一些数据丢失。

Here is the code: 这是代码:

Dataset<Row> stream = sparkSession.readStream()
          .format("kafka")
          .option("kafka.bootstrap.servers", kafkaBootstrapServersString)
          .option("subscribe", newTopic)
          .option("startingOffsets", "latest")
          .option("enable.auto.commit", false)
          .option("failOnDataLoss", false)
          .load();
StreamingQuery query = stream
        .writeStream()
        .option("startingOffsets", "earliest")
        .outputMode(OutputMode.Append())
        .foreach(sink)
        .trigger(Trigger.ProcessingTime(triggerInterval))
        .queryName(queryName)
        .start();

You may need to deal with maxOffsetsPerTrigger to adjust total input records per batch. 您可能需要处理maxOffsetsPerTrigger来调整每批的总输入记录。 Otherwise lag on your application may bring more records in a batch hence it slows down next batch, in turn bring more lags in following batches. 否则,您的应用程序上的延迟可能会在一个批次中带来更多的记录,因此会减慢下一个批次的速度,从而在后续的批次中带来更多的延迟。

Please refer below link for more details on Kafka configuration. 请参阅以下链接,以获取有关Kafka配置的更多详细信息。

https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM