Time Series data

Question

We have 2 data set and I want to generate a result table data set. How can we generate the resultant data set using pyspark or spark+ scala

The data is for the log file and I want to get the data which is having 2 columns in which it will show the start date and end date with period_state column.

Failed

+-------------------+
| fail_date         |
+-------------------+
| 2018-12-28        |
| 2018-12-29        |
| 2019-01-04        |
| 2019-01-05        |
+-------------------+

Succeeded

+-------------------+
| success_date      | 
+-------------------+
| 2018-12-30        |
| 2018-12-31        |
| 2019-01-01        |
| 2019-01-02        |
| 2019-01-03        |
| 2019-01-06        |
+-------------------+

Result table:

+--------------+--------------+--------------+
| period_state | start_date   | end_date     |
+--------------+--------------+--------------+
| succeeded    | 2019-01-01   | 2019-01-03   |
| failed       | 2019-01-04   | 2019-01-05   |
| succeeded    | 2019-01-06   | 2019-01-06   |
+--------------+--------------+--------------+

Answer 1

No need for UDFs. Just use window functions as follows:

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = fail.withColumn('state', F.lit(0)).union(
    success.withColumn('state', F.lit(1))
).toDF('date', 'state')

df.show()
+----------+-----+
|      date|state|
+----------+-----+
|2018-12-28|    0|
|2018-12-29|    0|
|2019-01-04|    0|
|2019-01-05|    0|
|2018-12-30|    1|
|2018-12-31|    1|
|2019-01-01|    1|
|2019-01-02|    1|
|2019-01-03|    1|
|2019-01-06|    1|
+----------+-----+

df2 = df.withColumn(
    'begin',
    F.coalesce(
        F.lag('state').over(Window.orderBy('date')) != F.col('state'), 
        F.lit(True)
    )
).withColumn(
    'end',
    F.coalesce(
        F.lead('state').over(Window.orderBy('date')) != F.col('state'), 
        F.lit(True)
    )
).withColumn(
    'last_change_date',
    F.last(
        F.when(F.col('begin'), F.col('date')), ignorenulls=True
    ).over(Window.orderBy('date'))
).filter(
    'end = true'
).select(
    F.when(
        F.col('state') == 1,
        F.lit('succeeded')
    ).otherwise(
        F.lit('failed')
    ).alias('period_state'),
    F.col('last_change_date').alias('start_date'), 
    F.col('date').alias('end_date')
)

df2.show()
+------------+----------+----------+
|period_state|start_date|  end_date|
+------------+----------+----------+
|      failed|2018-12-28|2018-12-29|
|   succeeded|2018-12-30|2019-01-03|
|      failed|2019-01-04|2019-01-05|
|   succeeded|2019-01-06|2019-01-06|
+------------+----------+----------+

If you are interested in the intermediate results:

+----------+-----+-----+-----+----------------+
|      date|state|begin|  end|last_change_date|
+----------+-----+-----+-----+----------------+
|2018-12-28|    0| true|false|      2018-12-28|
|2018-12-29|    0|false| true|      2018-12-28|
|2018-12-30|    1| true|false|      2018-12-30|
|2018-12-31|    1|false|false|      2018-12-30|
|2019-01-01|    1|false|false|      2018-12-30|
|2019-01-02|    1|false|false|      2018-12-30|
|2019-01-03|    1|false| true|      2018-12-30|
|2019-01-04|    0| true|false|      2019-01-04|
|2019-01-05|    0|false| true|      2019-01-04|
|2019-01-06|    1| true| true|      2019-01-06|
+----------+-----+-----+-----+----------------+

Time Series data

Question

1 answers

solution1
1 2020-12-07 17:05:21

Time Series data

Question

1 answers

solution1 1 2020-12-07 17:05:21

solution1
1 2020-12-07 17:05:21