無法使用 spark 將結果寫入 kafka 主題

Question

我的end goal是分批將聚合數據write out和read到新的Kafka topic 。 我遵循了official documentation和其他一些帖子，但沒有運氣。 我會先閱讀主題，執行聚合，將結果保存在另一個 Kafka 主題中，然后再次閱讀主題並在控制台中打印。 下面是我的代碼：

package com.sparkKafka
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming._
import scala.concurrent.duration._
object SparkKafkaTopic3 {
  def main(ar: Array[String]) {
    val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
    val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "songDemo5")
      .option("startingOffsets", "earliest")
      .load()

    import spark.implicits._
    df.printSchema()
    val newDf = df.select($"value".cast("string"), $"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
    val windowedCount = newDf
      .withWatermark("timestamp", "40000 milliseconds")
      .groupBy(
        window(col("timestamp"), "20 seconds"), col("songName"))
      .agg(count(col("songName")).alias("numberOfTimes"))


    val outputTopic = windowedCount
      .select(struct("*").cast("string").as("value")) // Added this line.
      .writeStream
      .format("kafka")
      .option("topic", "songDemo6")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("checkpointLocation", "/tmp/spark_ss/")
      .start()

    val finalOutput = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe", "songDemo6").option("startingOffsets", "earliest")
      .load()
      .writeStream.format("console")
      .outputMode("append").start()

    spark.streams.awaitAnyTermination()

  }
}

當我運行它時，最初在控制台中出現以下exception

java.lang.IllegalStateException: Cannot find earliest offsets of Set(songDemo4-0). Some data may have been missed. 
Some data may have been lost because they are not available in Kafka any more; either the
 data was aged out by Kafka or the topic may have been deleted before all the data in the
 topic was processed. If you don't want your streaming query to fail on such cases, set the
 source option "failOnDataLoss" to "false".

此外，如果我嘗試運行此代碼without寫入主題部分並再次閱讀它，一切正常。

我嘗試使用consumer command從 shell 讀取主題，但未顯示任何記錄。 我在這里有什么遺漏嗎？

下面是我的數據集：

>sid,Believer
>sid,Thunder
>sid,Stairway to heaven
>sid,Heaven
>sid,Heaven
>sid,thunder
>sid,Believer

當我運行@Srinivas 的代碼並閱讀新主題后，我得到如下數據：

[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:18:40, 2020-06-07 18:19:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Believer, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Stairway to heaven, 1]
[[2020-06-07 18:40:40, 2020-06-07 18:41:00], Heaven, 1]
[[2020-06-07 18:17:00, 2020-06-07 18:17:20], Thunder, 1]

在這里，您可以看到對於 Believer，window 框架是相同的，但條目是分開的。 為什么會這樣？ 它應該是計數為 2 的單個條目，因為 window 幀是相同的

Answer 1

檢查下面的代碼。

添加此windowedCount.select(struct("*").cast("string").as("value"))在您向 kafka 寫入任何內容之前，您必須轉換所有類型為string的列，該列的別名為value

 val spark = SparkSession.builder().appName("SparkKafka").master("local[*]").getOrCreate()
  val df = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "songDemo")
    .option("startingOffsets", "earliest")
    .load()

  import spark.implicits._
  df.printSchema()
  val newDf = df.select($"value".cast("string"),$"timestamp").select(split(col("value"), ",")(0).as("userName"), split(col("value"), ",")(1).as("songName"), col("timestamp"))
  val windowedCount = newDf
    .withWatermark("timestamp", "40000 milliseconds")
    .groupBy(
      window(col("timestamp"), "20 seconds"), col("songName"))
    .agg(count(col("songName")).alias("numberOfTimes"))


  val outputTopic = windowedCount
    .select(struct("*").cast("string").as("value")) // Added this line.
    .writeStream
    .format("kafka")
    .option("topic", "songDemoA")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("checkpointLocation", "/tmp/spark_ss/")
    .start()


  val finalOutput = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "songDemoA").option("startingOffsets", "earliest")
    .load()
    .writeStream.format("console")
    .outputMode("append").start()

  spark.streams.awaitAnyTermination()

更新- 訂購 Output


val windowedCount = newDf
    .withWatermark("timestamp", "40000 milliseconds")
    .groupBy(
      window(col("timestamp"), "20 seconds"), col("songName"))
    .agg(count(col("songName")).alias("numberOfTimes"))
    .orderBy($"window.start".asc) // Add this line if you want order.

僅當您使用 output 模式complete時，排序或排序結果才有效，對於任何其他值都會引發錯誤。

例如檢查下面的代碼。

val outputTopic = windowedCount
    .writeStream
    .format("console")
    .option("truncate","false")
    .outputMode("complete")
    .start()

無法使用 spark 將結果寫入 kafka 主題

問題描述

1 個解決方案

解決方案1
3 已采納 2020-06-07 13:33:18

無法使用 spark 將結果寫入 kafka 主題

問題描述

1 個解決方案

解決方案1 3 已采納 2020-06-07 13:33:18

解決方案1
3 已采納 2020-06-07 13:33:18