简体   繁体   中英

Pyspark Dataframe substract rows from the same column?

I have the following example as a pyspark dataframe:

Timeframe Person Activity
2022-06-21 8:00:00 Lisa Working
2022-06-21 8:30:00 Joseph Homework
2022-06-21 8:00:00 Michael Gardening
2022-06-21 9:00:00 Joseph Rowing
2022-06-21 9:00:00 Lisa Working
2022-06-21 9:15:00 Joseph Football
2022-06-21 10:00:00 Joseph Dancing
2022-06-21 10:00:00 Lisa Watering
2022-06-21 10:30:00 Joseph Gaming

I would like to calculate how long each activity for each person lasted. For example create a new column like this:

Timeframe Person Activity Duration
2022-06-21 8:00:00 Lisa Working 01:00:00
2022-06-21 8:30:00 Joseph Homework 00:30:00
2022-06-21 8:00:00 Michael Gardening 01:15:00
2022-06-21 9:00:00 Joseph Rowing 01:00:00
2022-06-21 9:00:00 Lisa Working 01:00:00
2022-06-21 9:15:00 Michael Football 01:45:00
2022-06-21 10:00:00 Joseph Dancing N/A
2022-06-21 10:00:00 Lisa Watering N/A
2022-06-21 10:30:00 Michael Gaming N/A

I need to substract the Timeframe row for each person separatly and create a new column. There is no pause in between. How can it be done in Pyspark or alternatively in Pandas?

Thanks!

We can calculate the time difference in seconds and convert it to the required format.

Using a subset of your data for example.

data_ls = [
    ('2022-06-21 8:00:00', 'Lisa', 'Working'),
    ('2022-06-21 8:30:00', 'Joe', 'HW'),
    ('2022-06-21 8:00:00', 'Mike', 'Gardening'),
    ('2022-06-21 9:00:00', 'Joe', 'Rowing'),
    ('2022-06-21 9:00:00', 'Lisa', 'Working')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
    withColumn('ts', func.col('ts').cast('timestamp'))

# +-------------------+----+---------+
# |                 ts|name| activity|
# +-------------------+----+---------+
# |2022-06-21 08:00:00|Lisa|  Working|
# |2022-06-21 08:30:00| Joe|       HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe|   Rowing|
# |2022-06-21 09:00:00|Lisa|  Working|
# +-------------------+----+---------+

We can take a lead() (next timestamp) for each name and subtract it from current timestamp to get the duration in seconds. Using the seconds, we can calculate minutes, hours, or even format it as a time string.

data_sdf. \
    withColumn('duration_sec', 
               func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'), 
                             func.lit(0)
                             )
               ). \
    withColumn('duration_min', func.col('duration_sec') / 60). \
    withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
    show()

# +-------------------+----+---------+------------+------------+-------------+
# |                 ts|name| activity|duration_sec|duration_min|duration_time|
# +-------------------+----+---------+------------+------------+-------------+
# |2022-06-21 08:30:00| Joe|       HW|        1800|        30.0|     00:30:00|
# |2022-06-21 09:00:00| Joe|   Rowing|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Lisa|  Working|        3600|        60.0|     01:00:00|
# |2022-06-21 09:00:00|Lisa|  Working|           0|         0.0|     00:00:00|
# +-------------------+----+---------+------------+------------+-------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM