为 Spark 结构化流解析 JSON

Question

我实现了 Spark Structured Streaming，对于我的用例，我必须指定起始偏移量。

而且，我有一个Array[String]形式的偏移值：

{"topic":"test","partition":0,"starting_offset":123}
{"topic":"test","partition":1,"starting_offset":456}

我想以编程方式将其转换为以下内容，以便我可以将其传递给 Spark。

{“测试”：{“0”：123，“1”：456}}

注意：这只是一个示例，我不断获得不同的偏移范围，因此我无法对其进行硬编码。

Answer 1

如果array是包含您描述的列表的变量，则：

>>> [{d['topic']: [d['partition'], d['starting_offset']]} for d in array]
[{'test': [0, 123]}, {'test': [1, 456]}]

Answer 2

scala> import org.json4s._
scala> import org.json4s.jackson.JsonMethods._

scala> val topicAsRawStr: Array[String] = Array(
          """{"topic":"test","partition":0,"starting_offset":123}""",
          """{"topic":"test","partition":1,"starting_offset":456}""")

scala> val topicAsJSONs = topicAsRawStr.map(rawText => {
         val json = parse(rawText)
         val topicName = json  \ "topic"  // Extract topic value
         val offsetForTopic = json  \ "starting_offset"  // Extract starting_offset
         topicName -> offsetForTopic
       })
scala> // Aggregate offsets for each topic

您还可以使用 spark.sparkContext.parallelize API。

scala> case class KafkaTopic(topicName: String, partitionId: Int, starting_offset: Int)

scala> val spark: SparkSession = ???

scala> val topicAsRawStr: Array[String] = Array(
          """{"topic":"test","partition":0,"starting_offset":123}""",
          """{"topic":"test","partition":1,"starting_offset":456}""")

scala> val topicAsJSONs = topicAsRawStr.map(line => json.parse(line).extract[KafkaTopic])

scala> val kafkaTopicDS = spark.sparkContext.parallelize(topicAsJSONs)

scala> val aggregatedOffsetsByTopic = kafkaTopicDS
    .groupByKey("topic")
    .mapGroups {
        case (topicName, kafkaTopics) => 
           val offsets = kafkaTopics.flatMap(kT => kT.starting_offset)
           (topicName -> offsets.toSet)
    }

为 Spark 结构化流解析 JSON

问题描述

2 个解决方案

解决方案1
0 2019-02-12 07:42:25

解决方案2
0 2019-02-12 08:29:41

为 Spark 结构化流解析 JSON

问题描述

2 个解决方案

解决方案1 0 2019-02-12 07:42:25

解决方案2 0 2019-02-12 08:29:41

解决方案1
0 2019-02-12 07:42:25

解决方案2
0 2019-02-12 08:29:41