[英]Using lag function in Spark Scala to bring values from another column
I have a dataframe that is such as the following, but that has several different items in the column "person".我有一个 dataframe 如下所示,但在“人”列中有几个不同的项目。
val df_beginning = Seq(("2022-06-06", "person1", 1),
("2022-06-13", "person1", 1),
("2022-06-20", "person1", 1),
("2022-06-27", "person1", 0),
("2022-07-04", "person1", 0),
("2022-07-11", "person1", 1),
("2022-07-18", "person1", 1),
("2022-07-25", "person1", 0),
("2022-08-01", "person1", 0),
("2022-08-08", "person1", 1),
("2022-08-15", "person1", 1),
("2022-08-22", "person1", 1),
("2022-08-29", "person1", 1))
.toDF("week", "person", "person_active_flag")
.orderBy($"week")
I want to create a new column that will have the week
in which that chain of person_active_flag with value 1
started.我想创建一个新列,其中包含值为
1
的 person_active_flag 链开始的那week
。 In the end, it would look something like this:最后,它看起来像这样:
val df_beginning = Seq(("2022-06-06", "person1", 1, "2022-06-06"),
("2022-06-13", "person1", 1, "2022-06-06"),
("2022-06-20", "person1", 1, "2022-06-06"),
("2022-06-27", "person1", 0, "0"),
("2022-07-04", "person1", 0, "0"),
("2022-07-11", "person1", 1, "2022-07-11"),
("2022-07-18", "person1", 1, "2022-07-11"),
("2022-07-25", "person1", 0, "0"),
("2022-08-01", "person1", 0, "0"),
("2022-08-08", "person1", 1, "2022-08-08"),
("2022-08-15", "person1", 1, "2022-08-08"),
("2022-08-22", "person1", 1, "2022-08-08"),
("2022-08-29", "person1", 1, "2022-08-08"))
.toDF("week", "person", "person_active_flag", "chain_beginning")
.orderBy($"week")
But I am not being able to do it.但我无法做到。 I have tried some variations of the code below, but it doesn't give me the right answer.
我已经尝试了下面代码的一些变体,但它没有给我正确的答案。 Can someone show me to do this, please?
有人可以告诉我这样做吗?
val w = Window.partitionBy($"person").orderBy($"week".asc)
df_beginning
.withColumn("beginning_chain",
when($"person_active_flag" === 1 && (lag($"person_active_flag", 1).over(w) === 0 || lag($"person_active_flag", 1).over(w).isNull), 1).otherwise(0)
)
.withColumn("first_week", when($"beginning_chain" === 1, $"week"))
.withColumn("beginning_chain_week",
when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w).isNull, $"first_week")
.when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 0, $"first_week")
.when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, lag($"first_week", 1).over(w))
// .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, "test")
.otherwise(0)
)
.d
lag
function to add helper column switch_flag
to show you when the flag changed from previous weeklag
function 添加辅助列switch_flag
以显示标志从前一周发生变化的时间week_beginning
only for rows where it switched from 0 to 1week_beginning
last(col, ignoreNulls = true)
extend week_beginning
to all rows where person is activelast(col, ignoreNulls = true)
将week_beginning
扩展到人员活动的所有行Final query:最终查询:
val window = Window.partitionBy($"person").orderBy($"week")
df_beginning
.withColumn("switch_flag", $"person_active_flag" - coalesce(lag($"person_active_flag", 1).over(window), lit(0)))
.withColumn("week_beginning_ind", when($"switch_flag" === 1, $"week"))
.withColumn("week_beginning", when($"person_active_flag" === 1, last($"week_beginning_ind", true).over(window)))
.show
+----------+-------+------------------+-----------+------------------+--------------+
| week| person|person_active_flag|switch_flag|week_beginning_ind|week_beginning|
+----------+-------+------------------+-----------+------------------+--------------+
|2022-06-06|person1| 1| 1| 2022-06-06| 2022-06-06|
|2022-06-13|person1| 1| 0| null| 2022-06-06|
|2022-06-20|person1| 1| 0| null| 2022-06-06|
|2022-06-27|person1| 0| -1| null| null|
|2022-07-04|person1| 0| 0| null| null|
|2022-07-11|person1| 1| 1| 2022-07-11| 2022-07-11|
|2022-07-18|person1| 1| 0| null| 2022-07-11|
|2022-07-25|person1| 0| -1| null| null|
|2022-08-01|person1| 0| 0| null| null|
|2022-08-08|person1| 1| 1| 2022-08-08| 2022-08-08|
|2022-08-15|person1| 1| 0| null| 2022-08-08|
|2022-08-22|person1| 1| 0| null| 2022-08-08|
|2022-08-29|person1| 1| 0| null| 2022-08-08|
+----------+-------+------------------+-----------+------------------+--------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.