Spark Window Function 构建时间线

Question

I am trying to build a timeline and I want to be able to detect timeline discontinuations.我正在尝试建立一个时间线，并且我希望能够检测到时间线中断。 I have this test df:我有这个测试df：

ID ID	date日期
1 1	2012-12-01 2012-12-01
1 1	2012-12-02 2012-12-02
1 1	2012-12-03 2012-12-03
1 1	2012-12-05 2012-12-05
1 1	2012-12-06 2012-12-06
1 1	2012-12-07 2012-12-07
1 1	2012-12-10 2012-12-10
1 1	2012-12-11 2012-12-11

And I would like to get a timeline with start-end dates likes this:我想得到一个开始结束日期的时间表，如下所示：

ID ID	date日期	end结尾
1 1	2012-12-01 2012-12-01	2012-12-03 2012-12-03
1 1	2012-12-05 2012-12-05	2012-12-07 2012-12-07
1 1	2012-12-10 2012-12-10	2012-12-11 2012-12-11

I've been trying with:我一直在尝试：

columns = ['id','snapshot_date']
data = [
('1','2012-12-01'),
('1','2012-12-02'), 
('1','2012-12-03'),
('1','2012-12-05'),
('1','2012-12-06'),
('1','2012-12-07'),
('1','2012-12-10'),
('1','2012-12-11')]

dftest = spark.createDataFrame(data).toDF(*columns)

w1 = Window.partitionBy('id').orderBy(F.col('date'))

df2 = (df1.withColumn("group_date", F.when( ~(F.date_add(F.col('snapshot_date'), -1) == F.lag(F.col("snapshot_date"), 1, 0).over(w1)), F.lit(1)).otherwise(F.lit(0))).filter(F.col('group_date')>1)

But not sure how to get the correct end date但不确定如何获得正确的结束日期

Answer 1

This is a case of sessionization, you can learn more about sessionization with spark with this article .这是一个 sessionization 的案例，你可以通过这篇文章了解更多关于 sessionization with spark 的知识。

And if we adapt the solution with window in the article cited above to your specific case, we get the following code:如果我们在上面引用的文章中使用 window 的解决方案来适应您的具体情况，我们会得到以下代码：

from pyspark.sql import functions as F
from pyspark.sql import Window

columns = ['id','snapshot_date']
data = [
('1','2012-12-01'),
('1','2012-12-02'), 
('1','2012-12-03'),
('1','2012-12-05'),
('1','2012-12-06'),
('1','2012-12-07'),
('1','2012-12-10'),
('1','2012-12-11')]

dftest = spark.createDataFrame(data).toDF(*columns)

w1 = Window.partitionBy('id').orderBy('snapshot_date')

df2 = dftest \
  .withColumn('session_change', F.when(F.datediff(F.col('snapshot_date'), F.lag('snapshot_date').over(w1)) > 1, F.lit(1)).otherwise(F.lit(0))) \
  .withColumn('session_id', F.sum('session_change').over(w1)) \
  .groupBy('ID', 'session_id') \
  .agg(F.min('snapshot_date').alias('date'), F.max('snapshot_date').alias('end')) \
  .drop('session_id')

That will give us the following df2 :这将为我们提供以下df2 ：

+---+----------+----------+
|ID |date      |end       |
+---+----------+----------+
|1  |2012-12-01|2012-12-03|
|1  |2012-12-05|2012-12-07|
|1  |2012-12-10|2012-12-11|
+---+----------+----------+

Spark Window Function 构建时间线

问题描述

1 个解决方案

解决方案1
0 2021-12-09 00:26:26

Spark Window Function 构建时间线

问题描述

1 个解决方案

解决方案1 0 2021-12-09 00:26:26

解决方案1
0 2021-12-09 00:26:26