Pyspark Dataframe substract rows from the same column?

Question

I have the following example as a pyspark dataframe:

Timeframe	Person	Activity
2022-06-21 8:00:00	Lisa	Working
2022-06-21 8:30:00	Joseph	Homework
2022-06-21 8:00:00	Michael	Gardening
2022-06-21 9:00:00	Joseph	Rowing
2022-06-21 9:00:00	Lisa	Working
2022-06-21 9:15:00	Joseph	Football
2022-06-21 10:00:00	Joseph	Dancing
2022-06-21 10:00:00	Lisa	Watering
2022-06-21 10:30:00	Joseph	Gaming

I would like to calculate how long each activity for each person lasted. For example create a new column like this:

Timeframe	Person	Activity	Duration
2022-06-21 8:00:00	Lisa	Working	01:00:00
2022-06-21 8:30:00	Joseph	Homework	00:30:00
2022-06-21 8:00:00	Michael	Gardening	01:15:00
2022-06-21 9:00:00	Joseph	Rowing	01:00:00
2022-06-21 9:00:00	Lisa	Working	01:00:00
2022-06-21 9:15:00	Michael	Football	01:45:00
2022-06-21 10:00:00	Joseph	Dancing	N/A
2022-06-21 10:00:00	Lisa	Watering	N/A
2022-06-21 10:30:00	Michael	Gaming	N/A

I need to substract the Timeframe row for each person separatly and create a new column. There is no pause in between. How can it be done in Pyspark or alternatively in Pandas?

Thanks!

Answer 1

We can calculate the time difference in seconds and convert it to the required format.

Using a subset of your data for example.

data_ls = [
    ('2022-06-21 8:00:00', 'Lisa', 'Working'),
    ('2022-06-21 8:30:00', 'Joe', 'HW'),
    ('2022-06-21 8:00:00', 'Mike', 'Gardening'),
    ('2022-06-21 9:00:00', 'Joe', 'Rowing'),
    ('2022-06-21 9:00:00', 'Lisa', 'Working')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
    withColumn('ts', func.col('ts').cast('timestamp'))

# +-------------------+----+---------+
# |                 ts|name| activity|
# +-------------------+----+---------+
# |2022-06-21 08:00:00|Lisa|  Working|
# |2022-06-21 08:30:00| Joe|       HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe|   Rowing|
# |2022-06-21 09:00:00|Lisa|  Working|
# +-------------------+----+---------+

We can take a lead() (next timestamp) for each name and subtract it from current timestamp to get the duration in seconds. Using the seconds, we can calculate minutes, hours, or even format it as a time string.

data_sdf. \
    withColumn('duration_sec', 
               func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'), 
                             func.lit(0)
                             )
               ). \
    withColumn('duration_min', func.col('duration_sec') / 60). \
    withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
    show()

# +-------------------+----+---------+------------+------------+-------------+
# |                 ts|name| activity|duration_sec|duration_min|duration_time|
# +-------------------+----+---------+------------+------------+-------------+
# |2022-06-21 08:30:00| Joe|       HW|        1800|        30.0|     00:30:00|
# |2022-06-21 09:00:00| Joe|   Rowing|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Lisa|  Working|        3600|        60.0|     01:00:00|
# |2022-06-21 09:00:00|Lisa|  Working|           0|         0.0|     00:00:00|
# +-------------------+----+---------+------------+------------+-------------+

Pyspark Dataframe substract rows from the same column?

Question

1 answers

solution1
1 ACCPTED 2022-07-07 11:04:15

Pyspark Dataframe substract rows from the same column?

Question

1 answers

solution1 1 ACCPTED 2022-07-07 11:04:15

solution1
1 ACCPTED 2022-07-07 11:04:15