如何將 Spark 模式應用於 Spark Structured Streaming 中基於 Kafka 主題名稱的查詢？

Question

我有一個 Spark Structured Streaming 作業，它基於subscribePattern從多個 Kafka 主題流式傳輸數據，並且對於每個 Kafka 主題，我都有一個 Spark 模式。 從 Kafka 流式傳輸數據時，我想根據主題名稱將 Spark 模式應用於 Kafka 消息。

考慮我有兩個主題： cust和customers 。

基於subscribePattern （Java 正則表達式字符串）從 Kafka 流式傳輸數據：

var df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribePattern", "cust*")
  .option("startingOffsets", "earliest") 
  .load()
  .withColumn("value", $"value".cast("string"))
  .filter($"value".isNotNull)

上述流式查詢流式傳輸來自兩個主題的數據。

假設我有兩個 Spark 模式，每個主題都有一個：

var cust: StructType = new StructType()
    .add("name", StringType)
    .add("age", IntegerType)

var customers: StructType = new StructType()
    .add("id", IntegerType)
    .add("first_name", StringType)
    .add("last_name", StringType)
    .add("email", StringType)
    .add("address", StringType)

現在，我想根據主題名稱應用 Spark Schema，為此我編寫了一個 udf，它讀取主題名稱並以 DDL 格式返回架構：

val schema = udf((table: String) => (table) match {
    case ("cust")      => cust.toDDL
    case ("customers") => customers.toDDL
    case _             => new StructType().toDDL
  })

然后我在from_json方法中使用 udf（我知道 udf 適用於每一列），如下所示：

val query = df
    .withColumn("topic", $"topic".cast("string"))
    .withColumn("data", from_json($"value", schema($"topic")))
    .select($"key", $"topic", $"data.*")
    .writeStream.outputMode("append")
    .format("console")
    .start()
    .awaitTermination()

這給了我以下正確的異常，因為from_json需要 DDL 格式或 StructType 的字符串模式。

org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of UDF(topic);

我想知道如何做到這一點？

任何幫助將不勝感激！

Answer 1

你在做什么是不可能的。 您的query df 不能有 2 個不同的模式。

我可以想到兩種方法：

按主題拆分您的df ，然后將您的 2 個模式應用於 2 個 dfs（ cust和customers ）
將 2 個模式合並為 1 個模式並將其應用於所有主題。

如何將 Spark 模式應用於 Spark Structured Streaming 中基於 Kafka 主題名稱的查詢？

問題描述

1 個解決方案

解決方案1
2 2020-04-30 00:41:42

如何將 Spark 模式應用於 Spark Structured Streaming 中基於 Kafka 主題名稱的查詢？

問題描述

1 個解決方案

解決方案1 2 2020-04-30 00:41:42

解決方案1
2 2020-04-30 00:41:42