[英]How to get the lag of a column in a Spark streaming dataframe?
我已经以这种格式将数据流式传输到我的Spark Scala应用程序中
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
我将其读入id,mark1,mark2,mark3和time列。 时间也将转换为日期时间格式。 我想按ID分组,并获得mark1的滞后时间,该滞后给出前一行的mark1值。 像这样:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
将数据框视为markDF。 我努力了:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
这表示非时间窗口不能应用于流式/追加式数据集/帧。
我也尝试过:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
为了获得一个窗口,行不起作用。 流窗口之类的window("timestamp", "10 minutes")
: window("timestamp", "10 minutes")
不能用于发送滞后。 我对如何做到这一点非常困惑。 任何帮助都是极好的!!
我建议您将time
列更改为String
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
之后,执行以下操作
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
这将给你输出为
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.