简体   繁体   English

无法使用 spark 将结果写入 kafka 主题

[英]Unable to write results to kafka topic using spark

My end goal is to write out and read the aggregated data to the new Kafka topic in the batches it gets processed.我的end goal是分批将聚合数据write outread到新的Kafka topic I followed the official documentation and a couple of other posts but no luck.我遵循了official documentation和其他一些帖子,但没有运气。 I would first read the topic, perform aggregation, save the results in another Kafka topic, and again read the topic and print it in the console.我会先阅读主题,执行聚合,将结果保存在另一个 Kafka 主题中,然后再次阅读主题并在控制台中打印。 Below is my code:下面是我的代码:

package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
object SparkKafkaTopic3 {
  def main(ar: Array[String]) {
    val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "songDemo5")
      .option("startingOffsets", "earliest")
      .load()

    import spark.implicits._
    df.printSchema()
    val newDf = df.select($"value".cast("string"), $"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
    val windowedCount = newDf
      .withWatermark("timestamp", "40000 milliseconds")
      .groupBy(
        window(col("timestamp"), "20 seconds"), col("songName"))
      .agg(count(col("songName")).alias("numberOfTimes"))


    val outputTopic = windowedCount
      .select(struct("*").cast("string").as("value")) // Added this line.
      .writeStream
      .format("kafka")
      .option("topic", "songDemo6")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("checkpointLocation", "/tmp/spark_ss/")
      .start()

    val finalOutput = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "songDemo6").option("startingOffsets", "earliest")
      .load()
      .writeStream.format("console")
      .outputMode("append").start()

    spark.streams.awaitAnyTermination()

  }
}

When I run this, in the console initially there is a below exception当我运行它时,最初在控制台中出现以下exception

java.lang.IllegalStateException: Cannot find earliest offsets of Set(songDemo4-0). Some data may have been missed. 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you don't want your streaming query to fail on such cases, set the
 source option "failOnDataLoss" to "false".

Also, if I try to run this code without writing to the topic part and reading it again everything works fine.此外,如果我尝试运行此代码without写入主题部分并再次阅读它,一切正常。

I tried to read the topic from the shell using consumer command but no records are displayed.我尝试使用consumer command从 shell 读取主题,但未显示任何记录。 Is there anything that I am missing over here?我在这里有什么遗漏吗?

Below is my dataset:下面是我的数据集:

>sid,Believer
>sid,Thunder
>sid,Stairway to heaven
>sid,Heaven
>sid,Heaven
>sid,thunder
>sid,Believer    

When I ran @Srinivas's code and after reading the new topic I am getting data as below:当我运行@Srinivas 的代码并阅读新主题后,我得到如下数据:

[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Stairway to heaven, 1]
[[2020-06-07 18:40:40, 2020-06-07 18:41:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Thunder, 1]

Here you can see for Believer the window frame is the same but still, the entries are separate.在这里,您可以看到对于 Believer,window 框架是相同的,但条目是分开的。 Why is it so?为什么会这样? It should be single entry with count 2 since the window frame is the same它应该是计数为 2 的单个条目,因为 window 帧是相同的

Check below code.检查下面的代码。

Added this windowedCount.select(struct("*").cast("string").as("value")) before you write anything to kafka you have to convert all columns of type string alias of that column is value添加此windowedCount.select(struct("*").cast("string").as("value"))在您向 kafka 写入任何内容之前,您必须转换所有类型为string的列,该列的别名为value

 val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
  val df = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "songDemo")
    .option("startingOffsets", "earliest")
    .load()

  import spark.implicits._
  df.printSchema()
  val newDf = df.select($"value".cast("string"),$"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
  val windowedCount = newDf
    .withWatermark("timestamp", "40000 milliseconds")
    .groupBy(
      window(col("timestamp"), "20 seconds"), col("songName"))
    .agg(count(col("songName")).alias("numberOfTimes"))


  val outputTopic = windowedCount
    .select(struct("*").cast("string").as("value")) // Added this line.
    .writeStream
    .format("kafka")
    .option("topic", "songDemoA")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("checkpointLocation", "/tmp/spark_ss/")
    .start()


  val finalOutput = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "songDemoA").option("startingOffsets", "earliest")
    .load()
    .writeStream.format("console")
    .outputMode("append").start()

  spark.streams.awaitAnyTermination()

Updated - Ordering Output更新- 订购 Output


val windowedCount = newDf
    .withWatermark("timestamp", "40000 milliseconds")
    .groupBy(
      window(col("timestamp"), "20 seconds"), col("songName"))
    .agg(count(col("songName")).alias("numberOfTimes"))
    .orderBy($"window.start".asc) // Add this line if you want order.

Ordering or sorting result works only if you use output mode is complete for any other values it will throw an error.仅当您使用 output 模式complete时,排序或排序结果才有效,对于任何其他值都会引发错误。

For example check below code.例如检查下面的代码。

val outputTopic = windowedCount
    .writeStream
    .format("console")
    .option("truncate","false")
    .outputMode("complete")
    .start()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM