简体   繁体   English

在 Spark scala 中,如何在数据帧中的相邻行之间进行检查

[英]In Spark scala, how to check between adjacent rows in a dataframe

How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe .如何检查Dataframe相邻行(前一行和下一行)的Dataframe This should happen at a key level这应该发生在关键层面

I have following data after sorting on key, dates在对键、日期进行排序后,我有以下数据

source_Df.show()
+-----+--------+------------+------------+
| key | code   | begin_dt   | end_dt     |
+-----+--------+------------+------------+
| 10  |  ABC   | 2018-01-01 | 2018-01-08 |
| 10  |  BAC   | 2018-01-03 | 2018-01-15 |
| 10  |  CAS   | 2018-01-03 | 2018-01-21 |
| 20  |  AAA   | 2017-11-12 | 2018-01-03 |
| 20  |  DAS   | 2018-01-01 | 2018-01-12 |
| 20  |  EDS   | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+

When the dates are in a range from these rows (ie the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.当日期在这些行的范围内时(即当前行begin_dt位于前一行的开始日期和结束日期之间),我需要在所有这些行上都有最低开始日期和最高结束日期。 Here is the output I need..这是我需要的输出..

final_Df.show()
+-----+--------+------------+------------+
| key | code   | begin_dt   | end_dt     |
+-----+--------+------------+------------+
| 10  |  ABC   | 2018-01-01 | 2018-01-21 |
| 10  |  BAC   | 2018-01-01 | 2018-01-21 |
| 10  |  CAS   | 2018-01-01 | 2018-01-21 |
| 20  |  AAA   | 2017-11-12 | 2018-01-12 |
| 20  |  DAS   | 2017-11-12 | 2018-01-12 |
| 20  |  EDS   | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+

Appreciate any ideas to achieve this.欣赏任何实现这一目标的想法。 Thanks in advance!提前致谢!

Here's one approach:这是一种方法:

  1. Create new column group_id with null value if begin_dt is within date range from the previous row;如果begin_dt在上一行的日期范围内,则创建具有null值的新列group_id otherwise a unique integer否则为唯一整数
  2. Backfill null s in group_id with the last non-null value回填null S IN group_idlast一个非空值
  3. Compute min(begin_dt) and max(end_dt) within each ( key, group_id) partition计算每个 ( key, group_id)分区内的min(begin_dt)max(end_dt)

Example below:下面的例子:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  (10, "ABC", "2018-01-01", "2018-01-08"),
  (10, "BAC", "2018-01-03", "2018-01-15"),
  (10, "CAS", "2018-01-03", "2018-01-21"),
  (20, "AAA", "2017-11-12", "2018-01-03"),
  (20, "DAS", "2018-01-01", "2018-01-12"),
  (20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")

val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")

df.
  withColumn("group_id", when(
      $"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
    ).otherwise(monotonically_increasing_id)
  ).
  withColumn("group_id", last($"group_id", ignoreNulls=true).
      over(win1.rowsBetween(Window.unboundedPreceding, 0))
  ).
  withColumn("begin_dt2", min($"begin_dt").over(win2)).
  withColumn("end_dt2", max($"end_dt").over(win2)).
  orderBy("key", "begin_dt", "end_dt").
  show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code|  begin_dt|    end_dt|     group_id| begin_dt2|   end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM