简体   繁体   English

Spark Windowspec滞后函数计算累积分数

[英]Spark Windowspec lag function calculating cumulative scores

I have a dataframe with scores for each day and I want to calculate a cumulative running score for each user. 我有一个每天都有分数的数据框,我想计算每个用户的累积跑步分数。 I need to sum up previous day cumulative score with today's score on a new column, I tried the lag function for the calculation, but some reasons it is giving an error. 我需要在新列上将前一天的累积得分与今天的得分相加,我尝试了lag函数进行计算,但是由于某些原因,它给出了错误。

Here is the code I tried: 这是我尝试的代码:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val genre = sc.parallelize(List(("Alice", "2016-05-01", "action",0),
                                    ("Alice", "2016-05-02", "0",1),
                                    ("Alice", "2016-05-03", "comedy",0),
                                    ("Alice", "2016-05-04", "action",1),
                                    ("Alice", "2016-05-05", "action",0),
                                    ("Alice", "2016-05-06", "horror",1),
                                    ("Bob", "2016-05-01", "art",0),
                                    ("Bob", "2016-05-02", "0",1),
                                    ("Bob", "2016-05-03", "0",0),
                                    ("Bob", "2016-05-04", "art",0),
                                    ("Bob", "2016-05-05", "comedy",1),
                                    ("Bob", "2016-05-06", "action",0))).
                               toDF("name", "date", "genre","score")

val wSpec2 = Window.partitionBy("name","genre").orderBy("date").rowsBetween(Long.MinValue, 0)
genre.withColumn( "CumScore",genre("score")*2+ lag(("CumScore"),1).over(wSpec2)*2  ).show()

dataframe: 数据帧:

-----+----------+------+-----+
| name|      date| genre|score|
+-----+----------+------+-----+
|Alice|2016-05-01|action|    0|
|Alice|2016-05-02|     0|    1|
|Alice|2016-05-03|comedy|    0|
|Alice|2016-05-04|action|    1|
|Alice|2016-05-05|action|    0|
|Alice|2016-05-06|horror|    1|
|  Bob|2016-05-01|   art|    0|
|  Bob|2016-05-02|     0|    1|
|  Bob|2016-05-03|     0|    0|
|  Bob|2016-05-04|   art|    0|
|  Bob|2016-05-05|comedy|    1|
|  Bob|2016-05-06|action|    0|
+-----+----------+------+-----+

Error I am getting 我收到错误

org.apache.spark.sql.AnalysisException: Window Frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()) must match the required frame specifiedwindowframe(RowFrame, -1, -1);
    at org.apa

I tried following approach: 我尝试了以下方法:

val wSpec2 = Window.partitionBy("name","genre").orderBy("date").rowsBetween(Long.MinValue, 0)
val test = genre.withColumn( "CumScore",genre("score")*2)
test.show()
val wSpec3 = Window.partitionBy("name").orderBy("date")
test.withColumn("CumScore_1",test("CumScore")+lag(test("CumScore"),1).over(wSpec3)).show()

We need to define another window function, as we need not to specify row frame while summing up previous day Cumulative score with today's score on a new column. 我们需要定义另一个窗口函数,因为在新列上将前一天的累积分数与今天的分数相加时,无需指定行框架。

You can refer: http://xinhstechblog.blogspot.in/2016/04/spark-window-functions-for-dataframes.html 您可以参考: http : //xinhstechblog.blogspot.in/2016/04/spark-window-functions-for-dataframes.html

There is no need to use lag , simply use a window partitioned on the user and then use sum : 无需使用lag ,只需使用在用户上分区的窗口,然后使用sum

val window = Window.partitionBy("name").orderBy("date").rowsBetween(Long.MinValue, 0)
genre.withColumn("CumScore", sum($"score").over(window))

Using the input data from the question, this will give: 使用问题的输入数据,将得出:

+-----+----------+------+-----+--------+
| name|      date| genre|score|CumScore|
+-----+----------+------+-----+--------+
|  Bob|2016-05-01|   art|    0|       0|
|  Bob|2016-05-02|     0|    1|       1|
|  Bob|2016-05-03|     0|    0|       1|
|  Bob|2016-05-04|   art|    0|       1|
|  Bob|2016-05-05|comedy|    1|       2|
|  Bob|2016-05-06|action|    0|       2|
|Alice|2016-05-01|action|    0|       0|
|Alice|2016-05-02|     0|    1|       1|
|Alice|2016-05-03|comedy|    0|       1|
|Alice|2016-05-04|action|    1|       2|
|Alice|2016-05-05|action|    0|       2|
|Alice|2016-05-06|horror|    1|       3|
+-----+----------+------+-----+--------+

The problem with using lag here is that the column is used in the same expression it is created (the column is used in the withColumn expression. Even though it's the previous value that is referred this is not allowed. 此处使用lag的问题在于,该列用于创建时使用的表达式相同(该列用于withColumn表达式。即使引用的是先前的值,也不允许这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM