I have the following example as a pyspark dataframe:
Timeframe | Person | Activity |
---|---|---|
2022-06-21 8:00:00 | Lisa | Working |
2022-06-21 8:30:00 | Joseph | Homework |
2022-06-21 8:00:00 | Michael | Gardening |
2022-06-21 9:00:00 | Joseph | Rowing |
2022-06-21 9:00:00 | Lisa | Working |
2022-06-21 9:15:00 | Joseph | Football |
2022-06-21 10:00:00 | Joseph | Dancing |
2022-06-21 10:00:00 | Lisa | Watering |
2022-06-21 10:30:00 | Joseph | Gaming |
I would like to calculate how long each activity for each person lasted. For example create a new column like this:
Timeframe | Person | Activity | Duration |
---|---|---|---|
2022-06-21 8:00:00 | Lisa | Working | 01:00:00 |
2022-06-21 8:30:00 | Joseph | Homework | 00:30:00 |
2022-06-21 8:00:00 | Michael | Gardening | 01:15:00 |
2022-06-21 9:00:00 | Joseph | Rowing | 01:00:00 |
2022-06-21 9:00:00 | Lisa | Working | 01:00:00 |
2022-06-21 9:15:00 | Michael | Football | 01:45:00 |
2022-06-21 10:00:00 | Joseph | Dancing | N/A |
2022-06-21 10:00:00 | Lisa | Watering | N/A |
2022-06-21 10:30:00 | Michael | Gaming | N/A |
I need to substract the Timeframe row for each person separatly and create a new column. There is no pause in between. How can it be done in Pyspark or alternatively in Pandas?
Thanks!
We can calculate the time difference in seconds and convert it to the required format.
Using a subset of your data for example.
data_ls = [
('2022-06-21 8:00:00', 'Lisa', 'Working'),
('2022-06-21 8:30:00', 'Joe', 'HW'),
('2022-06-21 8:00:00', 'Mike', 'Gardening'),
('2022-06-21 9:00:00', 'Joe', 'Rowing'),
('2022-06-21 9:00:00', 'Lisa', 'Working')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
withColumn('ts', func.col('ts').cast('timestamp'))
# +-------------------+----+---------+
# | ts|name| activity|
# +-------------------+----+---------+
# |2022-06-21 08:00:00|Lisa| Working|
# |2022-06-21 08:30:00| Joe| HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe| Rowing|
# |2022-06-21 09:00:00|Lisa| Working|
# +-------------------+----+---------+
We can take a lead()
(next timestamp) for each name
and subtract it from current timestamp to get the duration in seconds. Using the seconds, we can calculate minutes, hours, or even format it as a time string.
data_sdf. \
withColumn('duration_sec',
func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'),
func.lit(0)
)
). \
withColumn('duration_min', func.col('duration_sec') / 60). \
withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
show()
# +-------------------+----+---------+------------+------------+-------------+
# | ts|name| activity|duration_sec|duration_min|duration_time|
# +-------------------+----+---------+------------+------------+-------------+
# |2022-06-21 08:30:00| Joe| HW| 1800| 30.0| 00:30:00|
# |2022-06-21 09:00:00| Joe| Rowing| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Lisa| Working| 3600| 60.0| 01:00:00|
# |2022-06-21 09:00:00|Lisa| Working| 0| 0.0| 00:00:00|
# +-------------------+----+---------+------------+------------+-------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.