简体   繁体   English

使用Spark窗口函数导致在数据框中创建列

[英]Using Spark window function lead to create column in dataframe

I would like to create a new column with value of the previous date(date less the current date) for group of ids for the below dataframe 我想为下面的数据框的ID组创建一个具有上一个日期(日期减去当前日期)值的新列

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
|  a|2015-04-11|  300|
|  a|2015-04-12|  400|
|  a|2015-04-12|  200|
|  a|2015-04-12|  100|
|  a|2015-04-11|  700|
|  b|2015-04-02|  100|
|  b|2015-04-12|  100|
|  c|2015-04-12|  400|
+---+----------+-----+

I have tried with lead window function . 我已经尝试过铅窗功能。

val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")

 var w1=Window.partitionBy("id").orderBy("date".desc)
 var leadc1=lead(df1("value"),1).over(w1)
 val df2=df1.withColumn("nvalue",leadc1)

+---+----------+-----+------+                                                   
| id|      date|value|nvalue|
+---+----------+-----+------+
|  a|2015-04-12|  400|   200|
|  a|2015-04-12|  200|   100|
|  a|2015-04-12|  100|   300|
|  a|2015-04-11|  300|   700|
|  a|2015-04-11|  700|  null|
|  b|2015-04-12|  100|   100|
|  b|2015-04-02|  100|  null|
|  c|2015-04-12|  400|  null|
+---+----------+-----+------+

But as we can see when I have same date in id "a" I am getting wrong result.The result should be like 但是,正如我们看到的那样,当我在id“ a”中有相同的日期时,我得到了错误的结果。结果应该像

+---+----------+-----+------+                                                   
| id|      date|value|nvalue|
+---+----------+-----+------+
|  a|2015-04-12|  400|   300|
|  a|2015-04-12|  200|   300|
|  a|2015-04-12|  100|   300|
|  a|2015-04-11|  300|  null|
|  a|2015-04-11|  700|  null|
|  b|2015-04-12|  100|   100|
|  b|2015-04-02|  100|  null|
|  c|2015-04-12|  400|  null|
+---+----------+-----+------+

I already have a solution using join although I am looking for a solution using window function. 我已经有了使用join的解决方案,尽管我正在寻找使用window函数的解决方案。

Thanks 谢谢

The issue is you have multiple rows with the same date. 问题是您有多个具有相同日期的行。 lead will take value from the next row in the result set, not the next date . lead将从结果集中的下一而不是下一个日期 value So when you sort the rows by date in descending order, the next row could be the same date. 因此,当您按日期降序对行进行排序时,下一行可能是同一日期。

How do you identify the correct value to use for a particular date? 您如何确定在特定日期使用的正确值? for example why are you taking 300 from (id=a, date=2015-04-11), and not 700? 例如,为什么您要从(id = a,date = 2015-04-11)中获取300,而不是700?

To do this with window functions you may need to do multiple passes - this would take the last nvalue and apply it to all rows in the same id/date grouping - but I'm not sure how your rows are initially ordered. 要使用窗口函数执行此操作,您可能需要进行多次传递-这将获取最后的nvalue并将其应用于相同id / date分组中的所有行-但我不确定最初对行的排序方式。

 val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")

var w1 = Window.partitionBy("id").orderBy("date".desc)
var leadc1 = lead(df1("value"),1).over(w1)
val df2 = df1.withColumn("nvalue",leadc1)
val w2 = Window.partitionBy("id", "date").orderBy("??? some way to distinguish row ordering")
val df3 = df1.withColumn("nvalue2", last_value("nvalue").over(w2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM