Spark Streaming 1.6 + Kafka：处于“排队”状态的批次过多

Question

I'm using spark streaming to consume messages from a Kafka topic, which has 10 partitions. 我正在使用Spark Streaming来消耗来自Kafka主题的消息，该主题有10个分区。 I'm using direct approach to consume from kafka and the code can be found below: 我正在使用直接方法从kafka进行消费，其代码如下：

def createStreamingContext(conf: Conf): StreamingContext = {
    val dateFormat = conf.dateFormat.apply
    val hiveTable = conf.tableName.apply

    val sparkConf = new SparkConf()

    sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    sparkConf.set("spark.driver.allowMultipleContexts", "true")

    val sc = SparkContextBuilder.build(Some(sparkConf))
    val ssc = new StreamingContext(sc, Seconds(conf.batchInterval.apply))

    val kafkaParams = Map[String, String](
      "bootstrap.servers" -> conf.kafkaBrokers.apply,
      "key.deserializer" -> classOf[StringDeserializer].getName,
      "value.deserializer" -> classOf[StringDeserializer].getName,
      "auto.offset.reset" -> "smallest",
      "enable.auto.commit" -> "false"
    )

    val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
      ssc,
      kafkaParams,
      conf.topics.apply().split(",").toSet[String]
    )

    val windowedKafkaStream = directKafkaStream.window(Seconds(conf.windowDuration.apply))
    ssc.checkpoint(conf.sparkCheckpointDir.apply)

    val eirRDD: DStream[Row] = windowedKafkaStream.map { kv =>
      val fields: Array[String] = kv._2.split(",")
      createDomainObject(fields, dateFormat)
    }

    eirRDD.foreachRDD { rdd =>
      val schema = SchemaBuilder.build()
      val sqlContext: HiveContext = HiveSQLContext.getInstance(Some(rdd.context))
      val eirDF: DataFrame = sqlContext.createDataFrame(rdd, schema)

      eirDF
        .select(schema.map(c => col(c.name)): _*)
        .write
        .mode(SaveMode.Append)
        .partitionBy("year", "month", "day")
        .insertInto(hiveTable)
    }
    ssc
  }

As it can be seen from the code, I used window to achieve this ( and please correct me if I'm wrong ): Since there's an action to insert into a hive table, I want to avoid writing to HDFS too often, so what I want is to hold enough data in memory and only then write to the filesystem. 从代码中可以看出，我使用window来实现这一点（ 如果我输入错了，请纠正我 ）：由于有一个动作可以插入到配置单元表中，因此我想避免过多地写入HDFS，所以我想要的是在内存中保存足够的数据，然后再写入文件系统。 I thought that using window would be the right way to achieve it. 我认为使用window是实现它的正确方法。

Now, in the image below, you can see that there are many batches being queued and the batch being processed, takes forever to complete. 现在，在下图中，您可以看到有很多批次正在排队，并且正在处理该批次需要很长时间才能完成。

I'm also providing the details of the single batch being processed: 我还提供了正在处理的单个批次的详细信息：

Why are there so many tasks for the insert action, when there aren't many events in the batch? 当批处理中没有太多事件时，为什么要执行许多操作来执行插入操作？ Sometimes having 0 events also generates thousands of tasks that take forever to complete. 有时，拥有0个事件还会生成成千上万的任务，这些任务需要永远完成。

Is the way I process microbatches with Spark wrong? 我用Spark处理微批处理的方法是否错误？

Thanks for your help! 谢谢你的帮助！

Some extra details: 一些额外的细节：

Yarn containers have a max of 2gb. 纱线容器的最大容量为2gb。 In this Yarn queue, the maximum number of containers is 10. When I look at details of the queue where this spark application is being executed, the number of containers is extremely large, around 15k pending containers. 在此Yarn队列中，容器的最大数量为10。当我查看正在执行此spark应用程序的队列的详细信息时，容器的数量非常大，大约有1.5万个待处理容器。

Answer 1

Well, I finally figured it out. 好吧，我终于想通了。 Apparently Spark Streaming does not get along with empty events, so inside the foreachRDD portion of the code, I added the following: 显然，Spark Streaming不会遇到空事件，因此在代码的foreachRDD部分中，添加了以下内容：

eirRDD.foreachRDD { rdd =>
      if (rdd.take(1).length != 0) {
        //do action
      }
}

That way we skip empty micro-batches. 这样我们就跳过了空的微型批次。 the isEmpty() method does not work. isEmpty（）方法不起作用。

Hope this help somebody else! 希望这对其他人有帮助！ ;) ;）

Answer 2

You can check with rdd.partitions.size. 您可以使用rdd.partitions.size进行检查。 If size smaller than 1, that means the rdd is empty. 如果大小小于1，则表示rdd为空。 But if you use take() method you take error when rdd is empty. 但是，如果使用take（）方法，则在rdd为空时会出错。

Spark Streaming 1.6 + Kafka：处于“排队”状态的批次过多

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-07-19 14:55:59

解决方案2
0 2018-07-20 09:54:05

Spark Streaming 1.6 + Kafka：处于“排队”状态的批次过多

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-07-19 14:55:59

解决方案2 0 2018-07-20 09:54:05

解决方案1
0 已采纳 2018-07-19 14:55:59

解决方案2
0 2018-07-20 09:54:05