在 Spark Scala 中使用滞后 function 从另一列获取值

Question

I have a dataframe that is such as the following, but that has several different items in the column "person".我有一个 dataframe 如下所示，但在“人”列中有几个不同的项目。

val df_beginning = Seq(("2022-06-06", "person1", 1),
             ("2022-06-13", "person1", 1),
             ("2022-06-20", "person1", 1),
             ("2022-06-27", "person1", 0),
             ("2022-07-04", "person1", 0),
             ("2022-07-11", "person1", 1),
             ("2022-07-18", "person1", 1),
             ("2022-07-25", "person1", 0),
             ("2022-08-01", "person1", 0),
             ("2022-08-08", "person1", 1),
             ("2022-08-15", "person1", 1),
             ("2022-08-22", "person1", 1),
             ("2022-08-29", "person1", 1))
.toDF("week", "person", "person_active_flag")
.orderBy($"week")

I want to create a new column that will have the week in which that chain of person_active_flag with value 1 started.我想创建一个新列，其中包含值为1的 person_active_flag 链开始的那week 。 In the end, it would look something like this:最后，它看起来像这样：

val df_beginning = Seq(("2022-06-06", "person1", 1, "2022-06-06"),
             ("2022-06-13", "person1", 1, "2022-06-06"),
             ("2022-06-20", "person1", 1, "2022-06-06"),
             ("2022-06-27", "person1", 0, "0"),
             ("2022-07-04", "person1", 0, "0"),
             ("2022-07-11", "person1", 1, "2022-07-11"),
             ("2022-07-18", "person1", 1, "2022-07-11"),
             ("2022-07-25", "person1", 0, "0"),
             ("2022-08-01", "person1", 0, "0"),
             ("2022-08-08", "person1", 1, "2022-08-08"),
             ("2022-08-15", "person1", 1, "2022-08-08"),
             ("2022-08-22", "person1", 1, "2022-08-08"),
             ("2022-08-29", "person1", 1, "2022-08-08"))
.toDF("week", "person", "person_active_flag", "chain_beginning")
.orderBy($"week")

But I am not being able to do it.但我无法做到。 I have tried some variations of the code below, but it doesn't give me the right answer.我已经尝试了下面代码的一些变体，但它没有给我正确的答案。 Can someone show me to do this, please?有人可以告诉我这样做吗？

val w = Window.partitionBy($"person").orderBy($"week".asc)

df_beginning
.withColumn("beginning_chain", 
    when($"person_active_flag" === 1 && (lag($"person_active_flag", 1).over(w) === 0 || lag($"person_active_flag", 1).over(w).isNull), 1).otherwise(0)
)

.withColumn("first_week", when($"beginning_chain" === 1, $"week"))

.withColumn("beginning_chain_week", 
    when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w).isNull, $"first_week")
   .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 0, $"first_week")
   .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, lag($"first_week", 1).over(w))
//    .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, "test")
   .otherwise(0)
)
.d

Answer 1

Use lag function to add helper column switch_flag to show you when the flag changed from previous week使用lag function 添加辅助列switch_flag以显示标志从前一周发生变化的时间
Then mark week_beginning only for rows where it switched from 0 to 1然后只为从 0 切换到 1 的行标记week_beginning
Finally using last(col, ignoreNulls = true) extend week_beginning to all rows where person is active最后使用last(col, ignoreNulls = true)将week_beginning扩展到人员活动的所有行

Final query:最终查询：

val window = Window.partitionBy($"person").orderBy($"week")
df_beginning
  .withColumn("switch_flag", $"person_active_flag" - coalesce(lag($"person_active_flag", 1).over(window), lit(0)))
  .withColumn("week_beginning_ind", when($"switch_flag" === 1, $"week"))
  .withColumn("week_beginning", when($"person_active_flag" === 1, last($"week_beginning_ind", true).over(window)))
  .show

+----------+-------+------------------+-----------+------------------+--------------+
|      week| person|person_active_flag|switch_flag|week_beginning_ind|week_beginning|
+----------+-------+------------------+-----------+------------------+--------------+
|2022-06-06|person1|                 1|          1|        2022-06-06|    2022-06-06|
|2022-06-13|person1|                 1|          0|              null|    2022-06-06|
|2022-06-20|person1|                 1|          0|              null|    2022-06-06|
|2022-06-27|person1|                 0|         -1|              null|          null|
|2022-07-04|person1|                 0|          0|              null|          null|
|2022-07-11|person1|                 1|          1|        2022-07-11|    2022-07-11|
|2022-07-18|person1|                 1|          0|              null|    2022-07-11|
|2022-07-25|person1|                 0|         -1|              null|          null|
|2022-08-01|person1|                 0|          0|              null|          null|
|2022-08-08|person1|                 1|          1|        2022-08-08|    2022-08-08|
|2022-08-15|person1|                 1|          0|              null|    2022-08-08|
|2022-08-22|person1|                 1|          0|              null|    2022-08-08|
|2022-08-29|person1|                 1|          0|              null|    2022-08-08|
+----------+-------+------------------+-----------+------------------+--------------+

在 Spark Scala 中使用滞后 function 从另一列获取值

问题描述

1 个解决方案

解决方案1
2 2022-09-02 08:51:54

在 Spark Scala 中使用滞后 function 从另一列获取值

问题描述

1 个解决方案

解决方案1 2 2022-09-02 08:51:54

解决方案1
2 2022-09-02 08:51:54