简体   繁体   English

如何获取Spark流数据框中的列滞后?

[英]How to get the lag of a column in a Spark streaming dataframe?

I have data streaming into my Spark Scala application in this format 我已经以这种格式将数据流式传输到我的Spark Scala应用程序中

id    mark1 mark2 mark3 time
uuid1 100   200   300   Tue Aug  8 14:06:02 PDT 2017
uuid1 100   200   300   Tue Aug  8 14:06:22 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:32 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:52 PDT 2017
uuid2 150   250   350   Tue Aug  8 14:06:58 PDT 2017

I have it read into columns id, mark1, mark2, mark3 and time. 我将其读入id,mark1,mark2,mark3和time列。 The time is converted to datetime format as well. 时间也将转换为日期时间格式。 I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value. 我想按ID分组,并获得mark1的滞后时间,该滞后给出前一行的mark1值。 Something like this: 像这样:

id    mark1 mark2 mark3 prev_mark time
uuid1 100   200   300   null      Tue Aug  8 14:06:02 PDT 2017
uuid1 100   200   300   100       Tue Aug  8 14:06:22 PDT 2017
uuid2 150   250   350   null      Tue Aug  8 14:06:32 PDT 2017
uuid2 150   250   350   150       Tue Aug  8 14:06:52 PDT 2017
uuid2 150   250   350   150       Tue Aug  8 14:06:58 PDT 2017

Consider the dataframe to be markDF. 将数据框视为markDF。 I have tried: 我努力了:

val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`

which says non time windows cannot be applied on streaming/appending datasets/frames. 这表示非时间窗口不能应用于流式/追加式数据集/帧。

I have also tried: 我也尝试过:

val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))

To get a window for few rows which did not work either. 为了获得一个窗口,行不起作用。 The streaming window something like: window("timestamp", "10 minutes") cannot be used to send over the lag. 流窗口之类的window("timestamp", "10 minutes")window("timestamp", "10 minutes")不能用于发送滞后。 I am super confused on how to do this. 我对如何做到这一点非常困惑。 Any help would be awesome!! 任何帮助都是极好的!!

I would advise you to change the time column into String as 我建议您将time列更改为String

+-----+-----+-----+-----+----------------------------+
|id   |mark1|mark2|mark3|time                        |
+-----+-----+-----+-----+----------------------------+
|uuid1|100  |200  |300  |Tue Aug  8 14:06:02 PDT 2017|
|uuid1|100  |200  |300  |Tue Aug  8 14:06:22 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:32 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:52 PDT 2017|
|uuid2|150  |250  |350  |Tue Aug  8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+

root
 |-- id: string (nullable = true)
 |-- mark1: integer (nullable = false)
 |-- mark2: integer (nullable = false)
 |-- mark3: integer (nullable = false)
 |-- time: string (nullable = true)

After that doing the following should work 之后,执行以下操作

df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))

Which will give you output as 这将给你输出为

+-----+-----+-----+-----+----------------------------+---------+
|id   |mark1|mark2|mark3|time                        |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100  |200  |300  |Tue Aug  8 14:06:02 PDT 2017|null     |
|uuid1|100  |200  |300  |Tue Aug  8 14:06:22 PDT 2017|100      |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:32 PDT 2017|null     |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:52 PDT 2017|150      |
|uuid2|150  |250  |350  |Tue Aug  8 14:06:58 PDT 2017|150      |
+-----+-----+-----+-----+----------------------------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM