使用Spark窗口函数导致在数据框中创建列

Question

I would like to create a new column with value of the previous date(date less the current date) for group of ids for the below dataframe 我想为下面的数据框的ID组创建一个具有上一个日期（日期减去当前日期）值的新列

+---+----------+-----+
| id|      date|value|
+---+----------+-----+
|  a|2015-04-11|  300|
|  a|2015-04-12|  400|
|  a|2015-04-12|  200|
|  a|2015-04-12|  100|
|  a|2015-04-11|  700|
|  b|2015-04-02|  100|
|  b|2015-04-12|  100|
|  c|2015-04-12|  400|
+---+----------+-----+

I have tried with lead window function . 我已经尝试过铅窗功能。

val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")

 var w1=Window.partitionBy("id").orderBy("date".desc)
 var leadc1=lead(df1("value"),1).over(w1)
 val df2=df1.withColumn("nvalue",leadc1)

+---+----------+-----+------+                                                   
| id|      date|value|nvalue|
+---+----------+-----+------+
|  a|2015-04-12|  400|   200|
|  a|2015-04-12|  200|   100|
|  a|2015-04-12|  100|   300|
|  a|2015-04-11|  300|   700|
|  a|2015-04-11|  700|  null|
|  b|2015-04-12|  100|   100|
|  b|2015-04-02|  100|  null|
|  c|2015-04-12|  400|  null|
+---+----------+-----+------+

But as we can see when I have same date in id "a" I am getting wrong result.The result should be like 但是，正如我们看到的那样，当我在id“ a”中有相同的日期时，我得到了错误的结果。结果应该像

+---+----------+-----+------+                                                   
| id|      date|value|nvalue|
+---+----------+-----+------+
|  a|2015-04-12|  400|   300|
|  a|2015-04-12|  200|   300|
|  a|2015-04-12|  100|   300|
|  a|2015-04-11|  300|  null|
|  a|2015-04-11|  700|  null|
|  b|2015-04-12|  100|   100|
|  b|2015-04-02|  100|  null|
|  c|2015-04-12|  400|  null|
+---+----------+-----+------+

I already have a solution using join although I am looking for a solution using window function. 我已经有了使用join的解决方案，尽管我正在寻找使用window函数的解决方案。

Thanks 谢谢

Answer 1

The issue is you have multiple rows with the same date. 问题是您有多个具有相同日期的行。 lead will take value from the next row in the result set, not the next date . lead将从结果集中的下一行而不是下一个日期 value 。 So when you sort the rows by date in descending order, the next row could be the same date. 因此，当您按日期降序对行进行排序时，下一行可能是同一日期。

How do you identify the correct value to use for a particular date? 您如何确定在特定日期使用的正确值？ for example why are you taking 300 from (id=a, date=2015-04-11), and not 700? 例如，为什么您要从（id = a，date = 2015-04-11）中获取300，而不是700？

To do this with window functions you may need to do multiple passes - this would take the last nvalue and apply it to all rows in the same id/date grouping - but I'm not sure how your rows are initially ordered. 要使用窗口函数执行此操作，您可能需要进行多次传递-这将获取最后的nvalue并将其应用于相同id / date分组中的所有行-但我不确定最初对行的排序方式。

 val df1=Seq(("a","2015-04-11",300),("a","2015-04-12",400),("a","2015-04-12",200),("a","2015-04-12",100),("a","2015-04-11",700),("b","2015-04-02",100),("b","2015-04-12",100),("c","2015-04-12",400)).toDF("id","date","value")

var w1 = Window.partitionBy("id").orderBy("date".desc)
var leadc1 = lead(df1("value"),1).over(w1)
val df2 = df1.withColumn("nvalue",leadc1)
val w2 = Window.partitionBy("id", "date").orderBy("??? some way to distinguish row ordering")
val df3 = df1.withColumn("nvalue2", last_value("nvalue").over(w2))

使用Spark窗口函数导致在数据框中创建列

问题描述

1 个解决方案

解决方案1
0 2016-06-29 10:59:45

使用Spark窗口函数导致在数据框中创建列

问题描述

1 个解决方案

解决方案1 0 2016-06-29 10:59:45

解决方案1
0 2016-06-29 10:59:45