简体   繁体   English

在 Spark Scala 中使用滞后 function 从另一列获取值

[英]Using lag function in Spark Scala to bring values from another column

I have a dataframe that is such as the following, but that has several different items in the column "person".我有一个 dataframe 如下所示,但在“人”列中有几个不同的项目。

val df_beginning = Seq(("2022-06-06", "person1", 1),
             ("2022-06-13", "person1", 1),
             ("2022-06-20", "person1", 1),
             ("2022-06-27", "person1", 0),
             ("2022-07-04", "person1", 0),
             ("2022-07-11", "person1", 1),
             ("2022-07-18", "person1", 1),
             ("2022-07-25", "person1", 0),
             ("2022-08-01", "person1", 0),
             ("2022-08-08", "person1", 1),
             ("2022-08-15", "person1", 1),
             ("2022-08-22", "person1", 1),
             ("2022-08-29", "person1", 1))
.toDF("week", "person", "person_active_flag")
.orderBy($"week")

在此处输入图像描述

I want to create a new column that will have the week in which that chain of person_active_flag with value 1 started.我想创建一个新列,其中包含值为1的 person_active_flag 链开始的那week In the end, it would look something like this:最后,它看起来像这样:

val df_beginning = Seq(("2022-06-06", "person1", 1, "2022-06-06"),
             ("2022-06-13", "person1", 1, "2022-06-06"),
             ("2022-06-20", "person1", 1, "2022-06-06"),
             ("2022-06-27", "person1", 0, "0"),
             ("2022-07-04", "person1", 0, "0"),
             ("2022-07-11", "person1", 1, "2022-07-11"),
             ("2022-07-18", "person1", 1, "2022-07-11"),
             ("2022-07-25", "person1", 0, "0"),
             ("2022-08-01", "person1", 0, "0"),
             ("2022-08-08", "person1", 1, "2022-08-08"),
             ("2022-08-15", "person1", 1, "2022-08-08"),
             ("2022-08-22", "person1", 1, "2022-08-08"),
             ("2022-08-29", "person1", 1, "2022-08-08"))
.toDF("week", "person", "person_active_flag", "chain_beginning")
.orderBy($"week")

在此处输入图像描述

But I am not being able to do it.但我无法做到。 I have tried some variations of the code below, but it doesn't give me the right answer.我已经尝试了下面代码的一些变体,但它没有给我正确的答案。 Can someone show me to do this, please?有人可以告诉我这样做吗?

val w = Window.partitionBy($"person").orderBy($"week".asc)

df_beginning
.withColumn("beginning_chain", 
    when($"person_active_flag" === 1 && (lag($"person_active_flag", 1).over(w) === 0 || lag($"person_active_flag", 1).over(w).isNull), 1).otherwise(0)
)

.withColumn("first_week", when($"beginning_chain" === 1, $"week"))

.withColumn("beginning_chain_week", 
    when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w).isNull, $"first_week")
   .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 0, $"first_week")
   .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, lag($"first_week", 1).over(w))
//    .when($"person_active_flag" === 1 && lag($"person_active_flag", 1).over(w) === 1, "test")
   .otherwise(0)
)
.d

在此处输入图像描述

  • Use lag function to add helper column switch_flag to show you when the flag changed from previous week使用lag function 添加辅助列switch_flag以显示标志从前一周发生变化的时间
  • Then mark week_beginning only for rows where it switched from 0 to 1然后只为从 0 切换到 1 的行标记week_beginning
  • Finally using last(col, ignoreNulls = true) extend week_beginning to all rows where person is active最后使用last(col, ignoreNulls = true)week_beginning扩展到人员活动的所有行

Final query:最终查询:

val window = Window.partitionBy($"person").orderBy($"week")
df_beginning
  .withColumn("switch_flag", $"person_active_flag" - coalesce(lag($"person_active_flag", 1).over(window), lit(0)))
  .withColumn("week_beginning_ind", when($"switch_flag" === 1, $"week"))
  .withColumn("week_beginning", when($"person_active_flag" === 1, last($"week_beginning_ind", true).over(window)))
  .show

+----------+-------+------------------+-----------+------------------+--------------+
|      week| person|person_active_flag|switch_flag|week_beginning_ind|week_beginning|
+----------+-------+------------------+-----------+------------------+--------------+
|2022-06-06|person1|                 1|          1|        2022-06-06|    2022-06-06|
|2022-06-13|person1|                 1|          0|              null|    2022-06-06|
|2022-06-20|person1|                 1|          0|              null|    2022-06-06|
|2022-06-27|person1|                 0|         -1|              null|          null|
|2022-07-04|person1|                 0|          0|              null|          null|
|2022-07-11|person1|                 1|          1|        2022-07-11|    2022-07-11|
|2022-07-18|person1|                 1|          0|              null|    2022-07-11|
|2022-07-25|person1|                 0|         -1|              null|          null|
|2022-08-01|person1|                 0|          0|              null|          null|
|2022-08-08|person1|                 1|          1|        2022-08-08|    2022-08-08|
|2022-08-15|person1|                 1|          0|              null|    2022-08-08|
|2022-08-22|person1|                 1|          0|              null|    2022-08-08|
|2022-08-29|person1|                 1|          0|              null|    2022-08-08|
+----------+-------+------------------+-----------+------------------+--------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Scala:从数据框中选择列时使用 spark sql 函数 - Scala: Using a spark sql function when selecting column from a dataframe 使用Scala / Spark复制列中的值 - Replicate values in a column using Scala/Spark 如何使用scala查找由火花数据框中的另一列值分组的列中的数组总和 - How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala Spark Scala数据框-用来自另一个数据框的值替换/联接列值(但已转置) - Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed) 使用带有 scala 的 Spark dataframe 中的 JSON 类型的列获取所有值,而不考虑键 - Fetch all values irrespective of keys from a column of JSON type in a Spark dataframe using Spark with scala 火花滞后,默认值作为另一列 - Spark lag with default value as another column 使用Spark Scala检查一个数据框列中的值是否在另一数据框列中存在 - Check if value from one dataframe column exists in another dataframe column using Spark Scala 如何使用 Spark/Scala 中另一列的分隔符拆分列 - How do I split a column by using delimiters from another column in Spark/Scala 在火花标量中滞后计数 - Lag with count in spark scala 使用Scala从Spark中列的一系列值汇总到一个新列 - Sum up into a new column from a range of values of a column in Spark using Scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM