如何在火花结构化流中将kafka时间戳值包含为列？

Question

I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. 我正在寻找将kafka的时间戳值添加到我的Spark结构化流模式的解决方案。 I have extracted the value field from kafka and making dataframe. 我已经从kafka中提取了value字段并制作了dataframe。 My issue is, I need to get the timestamp field (from kafka) also along with the other columns. 我的问题是，我还需要获取时间戳字段（来自kafka）以及其他列。

Here is my current code: 这是我当前的代码：

val kafkaDatademostr = spark
  .readStream 
  .format("kafka")
  .option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
  .option("subscribe","csvstream")
  .load

val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
  .select("csv.*")

val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
    "split(value,',')[1] as DFW",
    "split(value,',')[2] as DTG",
    "split(value,',')[3] as CDF",
    "split(value,',')[4] as DFO",
    "split(value,',')[5] as SAD",
    "split(value,',')[6] as DER",
    "split(value,',')[7] as time_for",
    "split(value,',')[8] as fort")

How can I get the timestamp from kafka and add as columns along with other columns? 如何从kafka获取时间戳并与其他列一起添加为列？

Answer 1

Timestamp is included in the source schema. 时间戳包含在源模式中。 Just add a "select timestamp" to get the timestamp like the below. 只需添加“选择时间戳记”即可获得如下所示的时间戳记。

val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")

Answer 2

At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) 在Apache Spark官方网站上，您可以找到指南：结构化流+ Kafka集成指南（Kafka代理版本0.10.0或更高版本）

There you can find information about the schema of DataFrame that is loaded from Kafka. 在这里，您可以找到有关从Kafka加载的DataFrame架构的信息。

Each row from Kafka source has following columns: Kafka来源中的每一行都有以下列：

key - message key 键-消息键
value - message value 值-消息值
topic - name message topic 主题-名称消息主题
partition - partitions from which that message came from partition-该消息来自的分区
offset - offset of the message offset-消息的偏移量
timestamp - timestamp 时间戳-时间戳
timestampType timestamp type timestampType时间戳类型

All of above columns are available to query. 以上所有列均可查询。 In your example you use only value , so to get timestamp just need to add timestamp to your select statement: 在您的示例中，您仅使用value ，因此要获取时间戳，只需将timestamp添加到您的select语句中：

  val allFields = kafkaDatademostr.selectExpr(
    s"CAST(value AS STRING) AS csv",
    s"CAST(key AS STRING) AS key",
    s"topic as topic",
    s"partition as partition",
    s"offset as offset",
    s"timestamp as timestamp",
    s"timestampType as timestampType"
  )

Answer 3

In my case of Kafka, I was receiving the values in JSON format. 以Kafka为例，我收到的是JSON格式的值。 Which contains the actual data along with original Event Time not kafka timestamp. 其中包含实际数据以及原始事件时间（不是kafka时间戳）。 Below is the schema. 下面是架构。

val mySchema = StructType(Array(
      StructField("time", LongType),
      StructField("close", DoubleType)
    ))

In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format. 为了使用Spark结构化流的水印功能，我不得不将时间字段转换为时间戳格式。

val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
      .select(from_json($"value", mySchema).as("data"))
      .select(col("data.time").cast("timestamp").alias("time"),col("data.close"))

Now you can use the time field for window operation as well as watermarking purpose. 现在，您可以将时间字段用于窗口操作 和加水印目的。

import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
                      .groupBy(
                          window(col("time"), "1 minute", "30 seconds"),
                          $"close"
                      ).count()

I hope this answer clarifies. 我希望这个答案能弄清楚。

如何在火花结构化流中将kafka时间戳值包含为列？

问题描述

3 个解决方案

解决方案1
2 已采纳 2019-01-22 04:53:08

解决方案2
1 2019-01-22 11:03:41

解决方案3
0 2019-04-06 20:28:57

如何在火花结构化流中将kafka时间戳值包含为列？

问题描述

3 个解决方案

解决方案1 2 已采纳 2019-01-22 04:53:08

解决方案2 1 2019-01-22 11:03:41

解决方案3 0 2019-04-06 20:28:57

解决方案1
2 已采纳 2019-01-22 04:53:08

解决方案2
1 2019-01-22 11:03:41

解决方案3
0 2019-04-06 20:28:57