简体   繁体   English

Spark Window Function 构建时间线

[英]Spark Window Function to build timeline

I am trying to build a timeline and I want to be able to detect timeline discontinuations.我正在尝试建立一个时间线,并且我希望能够检测到时间线中断。 I have this test df:我有这个测试df:

ID ID date日期
1 1 2012-12-01 2012-12-01
1 1 2012-12-02 2012-12-02
1 1 2012-12-03 2012-12-03
1 1 2012-12-05 2012-12-05
1 1 2012-12-06 2012-12-06
1 1 2012-12-07 2012-12-07
1 1 2012-12-10 2012-12-10
1 1 2012-12-11 2012-12-11

And I would like to get a timeline with start-end dates likes this:我想得到一个开始结束日期的时间表,如下所示:

ID ID date日期 end结尾
1 1 2012-12-01 2012-12-01 2012-12-03 2012-12-03
1 1 2012-12-05 2012-12-05 2012-12-07 2012-12-07
1 1 2012-12-10 2012-12-10 2012-12-11 2012-12-11

I've been trying with:我一直在尝试:

columns = ['id','snapshot_date']
data = [
('1','2012-12-01'),
('1','2012-12-02'), 
('1','2012-12-03'),
('1','2012-12-05'),
('1','2012-12-06'),
('1','2012-12-07'),
('1','2012-12-10'),
('1','2012-12-11')]

dftest = spark.createDataFrame(data).toDF(*columns)

w1 = Window.partitionBy('id').orderBy(F.col('date'))

df2 = (df1.withColumn("group_date", F.when( ~(F.date_add(F.col('snapshot_date'), -1) == F.lag(F.col("snapshot_date"), 1, 0).over(w1)), F.lit(1)).otherwise(F.lit(0))).filter(F.col('group_date')>1)               
 

But not sure how to get the correct end date但不确定如何获得正确的结束日期

This is a case of sessionization, you can learn more about sessionization with spark with this article .这是一个 sessionization 的案例,你可以通过这篇文章了解更多关于 sessionization with spark 的知识。

And if we adapt the solution with window in the article cited above to your specific case, we get the following code:如果我们在上面引用的文章中使用 window 的解决方案来适应您的具体情况,我们会得到以下代码:

from pyspark.sql import functions as F
from pyspark.sql import Window

columns = ['id','snapshot_date']
data = [
('1','2012-12-01'),
('1','2012-12-02'), 
('1','2012-12-03'),
('1','2012-12-05'),
('1','2012-12-06'),
('1','2012-12-07'),
('1','2012-12-10'),
('1','2012-12-11')]

dftest = spark.createDataFrame(data).toDF(*columns)

w1 = Window.partitionBy('id').orderBy('snapshot_date')

df2 = dftest \
  .withColumn('session_change', F.when(F.datediff(F.col('snapshot_date'), F.lag('snapshot_date').over(w1)) > 1, F.lit(1)).otherwise(F.lit(0))) \
  .withColumn('session_id', F.sum('session_change').over(w1)) \
  .groupBy('ID', 'session_id') \
  .agg(F.min('snapshot_date').alias('date'), F.max('snapshot_date').alias('end')) \
  .drop('session_id')

That will give us the following df2 :这将为我们提供以下df2

+---+----------+----------+
|ID |date      |end       |
+---+----------+----------+
|1  |2012-12-01|2012-12-03|
|1  |2012-12-05|2012-12-07|
|1  |2012-12-10|2012-12-11|
+---+----------+----------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM