We have 2 data set and I want to generate a result table data set. How can we generate the resultant data set using pyspark or spark+ scala
The data is for the log file and I want to get the data which is having 2 columns in which it will show the start date and end date with period_state column.
Failed
+-------------------+
| fail_date |
+-------------------+
| 2018-12-28 |
| 2018-12-29 |
| 2019-01-04 |
| 2019-01-05 |
+-------------------+
Succeeded
+-------------------+
| success_date |
+-------------------+
| 2018-12-30 |
| 2018-12-31 |
| 2019-01-01 |
| 2019-01-02 |
| 2019-01-03 |
| 2019-01-06 |
+-------------------+
Result table:
+--------------+--------------+--------------+
| period_state | start_date | end_date |
+--------------+--------------+--------------+
| succeeded | 2019-01-01 | 2019-01-03 |
| failed | 2019-01-04 | 2019-01-05 |
| succeeded | 2019-01-06 | 2019-01-06 |
+--------------+--------------+--------------+
No need for UDFs. Just use window functions as follows:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = fail.withColumn('state', F.lit(0)).union(
success.withColumn('state', F.lit(1))
).toDF('date', 'state')
df.show()
+----------+-----+
| date|state|
+----------+-----+
|2018-12-28| 0|
|2018-12-29| 0|
|2019-01-04| 0|
|2019-01-05| 0|
|2018-12-30| 1|
|2018-12-31| 1|
|2019-01-01| 1|
|2019-01-02| 1|
|2019-01-03| 1|
|2019-01-06| 1|
+----------+-----+
df2 = df.withColumn(
'begin',
F.coalesce(
F.lag('state').over(Window.orderBy('date')) != F.col('state'),
F.lit(True)
)
).withColumn(
'end',
F.coalesce(
F.lead('state').over(Window.orderBy('date')) != F.col('state'),
F.lit(True)
)
).withColumn(
'last_change_date',
F.last(
F.when(F.col('begin'), F.col('date')), ignorenulls=True
).over(Window.orderBy('date'))
).filter(
'end = true'
).select(
F.when(
F.col('state') == 1,
F.lit('succeeded')
).otherwise(
F.lit('failed')
).alias('period_state'),
F.col('last_change_date').alias('start_date'),
F.col('date').alias('end_date')
)
df2.show()
+------------+----------+----------+
|period_state|start_date| end_date|
+------------+----------+----------+
| failed|2018-12-28|2018-12-29|
| succeeded|2018-12-30|2019-01-03|
| failed|2019-01-04|2019-01-05|
| succeeded|2019-01-06|2019-01-06|
+------------+----------+----------+
If you are interested in the intermediate results:
+----------+-----+-----+-----+----------------+
| date|state|begin| end|last_change_date|
+----------+-----+-----+-----+----------------+
|2018-12-28| 0| true|false| 2018-12-28|
|2018-12-29| 0|false| true| 2018-12-28|
|2018-12-30| 1| true|false| 2018-12-30|
|2018-12-31| 1|false|false| 2018-12-30|
|2019-01-01| 1|false|false| 2018-12-30|
|2019-01-02| 1|false|false| 2018-12-30|
|2019-01-03| 1|false| true| 2018-12-30|
|2019-01-04| 0| true|false| 2019-01-04|
|2019-01-05| 0|false| true| 2019-01-04|
|2019-01-06| 1| true| true| 2019-01-06|
+----------+-----+-----+-----+----------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.