[英]Pyspark Dataframe substract rows from the same column?
I have the following example as a pyspark dataframe:我有以下示例作为 pyspark 数据框:
Timeframe大体时间 | Person人 | Activity活动 |
---|---|---|
2022-06-21 8:00:00 2022-06-21 8:00:00 | Lisa丽莎 | Working在职的 |
2022-06-21 8:30:00 2022-06-21 8:30:00 | Joseph约瑟夫 | Homework家庭作业 |
2022-06-21 8:00:00 2022-06-21 8:00:00 | Michael迈克尔 | Gardening园艺 |
2022-06-21 9:00:00 2022-06-21 9:00:00 | Joseph约瑟夫 | Rowing划船 |
2022-06-21 9:00:00 2022-06-21 9:00:00 | Lisa丽莎 | Working在职的 |
2022-06-21 9:15:00 2022-06-21 9:15:00 | Joseph约瑟夫 | Football足球 |
2022-06-21 10:00:00 2022-06-21 10:00:00 | Joseph约瑟夫 | Dancing跳舞 |
2022-06-21 10:00:00 2022-06-21 10:00:00 | Lisa丽莎 | Watering浇水 |
2022-06-21 10:30:00 2022-06-21 10:30:00 | Joseph约瑟夫 | Gaming赌博 |
I would like to calculate how long each activity for each person lasted.我想计算每个人的每项活动持续了多长时间。 For example create a new column like this:例如创建一个这样的新列:
Timeframe大体时间 | Person人 | Activity活动 | Duration期间 |
---|---|---|---|
2022-06-21 8:00:00 2022-06-21 8:00:00 | Lisa丽莎 | Working在职的 | 01:00:00 01:00:00 |
2022-06-21 8:30:00 2022-06-21 8:30:00 | Joseph约瑟夫 | Homework家庭作业 | 00:30:00 00:30:00 |
2022-06-21 8:00:00 2022-06-21 8:00:00 | Michael迈克尔 | Gardening园艺 | 01:15:00 01:15:00 |
2022-06-21 9:00:00 2022-06-21 9:00:00 | Joseph约瑟夫 | Rowing划船 | 01:00:00 01:00:00 |
2022-06-21 9:00:00 2022-06-21 9:00:00 | Lisa丽莎 | Working在职的 | 01:00:00 01:00:00 |
2022-06-21 9:15:00 2022-06-21 9:15:00 | Michael迈克尔 | Football足球 | 01:45:00 01:45:00 |
2022-06-21 10:00:00 2022-06-21 10:00:00 | Joseph约瑟夫 | Dancing跳舞 | N/A不适用 |
2022-06-21 10:00:00 2022-06-21 10:00:00 | Lisa丽莎 | Watering浇水 | N/A不适用 |
2022-06-21 10:30:00 2022-06-21 10:30:00 | Michael迈克尔 | Gaming赌博 | N/A不适用 |
I need to substract the Timeframe row for each person separatly and create a new column.我需要分别减去每个人的 Timeframe 行并创建一个新列。 There is no pause in between.中间没有停顿。 How can it be done in Pyspark or alternatively in Pandas?如何在 Pyspark 或 Pandas 中完成?
Thanks!谢谢!
We can calculate the time difference in seconds and convert it to the required format.我们可以计算以秒为单位的时间差并将其转换为所需的格式。
Using a subset of your data for example.例如,使用数据的子集。
data_ls = [
('2022-06-21 8:00:00', 'Lisa', 'Working'),
('2022-06-21 8:30:00', 'Joe', 'HW'),
('2022-06-21 8:00:00', 'Mike', 'Gardening'),
('2022-06-21 9:00:00', 'Joe', 'Rowing'),
('2022-06-21 9:00:00', 'Lisa', 'Working')
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
withColumn('ts', func.col('ts').cast('timestamp'))
# +-------------------+----+---------+
# | ts|name| activity|
# +-------------------+----+---------+
# |2022-06-21 08:00:00|Lisa| Working|
# |2022-06-21 08:30:00| Joe| HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe| Rowing|
# |2022-06-21 09:00:00|Lisa| Working|
# +-------------------+----+---------+
We can take a lead()
(next timestamp) for each name
and subtract it from current timestamp to get the duration in seconds.我们可以为每个name
取一个lead()
(下一个时间戳),然后从当前时间戳中减去它,以获得以秒为单位的持续时间。 Using the seconds, we can calculate minutes, hours, or even format it as a time string.使用秒,我们可以计算分钟、小时,甚至可以将其格式化为时间字符串。
data_sdf. \
withColumn('duration_sec',
func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'),
func.lit(0)
)
). \
withColumn('duration_min', func.col('duration_sec') / 60). \
withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
show()
# +-------------------+----+---------+------------+------------+-------------+
# | ts|name| activity|duration_sec|duration_min|duration_time|
# +-------------------+----+---------+------------+------------+-------------+
# |2022-06-21 08:30:00| Joe| HW| 1800| 30.0| 00:30:00|
# |2022-06-21 09:00:00| Joe| Rowing| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening| 0| 0.0| 00:00:00|
# |2022-06-21 08:00:00|Lisa| Working| 3600| 60.0| 01:00:00|
# |2022-06-21 09:00:00|Lisa| Working| 0| 0.0| 00:00:00|
# +-------------------+----+---------+------------+------------+-------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.