[英]How to get the lag of a column in a Spark streaming dataframe?
I have data streaming into my Spark Scala application in this format 我已经以这种格式将数据流式传输到我的Spark Scala应用程序中
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
I have it read into columns id, mark1, mark2, mark3 and time. 我将其读入id,mark1,mark2,mark3和time列。 The time is converted to datetime format as well. 时间也将转换为日期时间格式。 I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value. 我想按ID分组,并获得mark1的滞后时间,该滞后给出前一行的mark1值。 Something like this: 像这样:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
Consider the dataframe to be markDF. 将数据框视为markDF。 I have tried: 我努力了:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
which says non time windows cannot be applied on streaming/appending datasets/frames. 这表示非时间窗口不能应用于流式/追加式数据集/帧。
I have also tried: 我也尝试过:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
To get a window for few rows which did not work either. 为了获得一个窗口,行不起作用。 The streaming window something like: window("timestamp", "10 minutes")
cannot be used to send over the lag. 流窗口之类的window("timestamp", "10 minutes")
: window("timestamp", "10 minutes")
不能用于发送滞后。 I am super confused on how to do this. 我对如何做到这一点非常困惑。 Any help would be awesome!! 任何帮助都是极好的!!
I would advise you to change the time
column into String
as 我建议您将time
列更改为String
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
After that doing the following should work 之后,执行以下操作
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
Which will give you output as 这将给你输出为
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.