简体   繁体   English

Pyspark Dataframe 从同一列中减去行?

[英]Pyspark Dataframe substract rows from the same column?

I have the following example as a pyspark dataframe:我有以下示例作为 pyspark 数据框:

Timeframe大体时间 Person Activity活动
2022-06-21 8:00:00 2022-06-21 8:00:00 Lisa丽莎 Working在职的
2022-06-21 8:30:00 2022-06-21 8:30:00 Joseph约瑟夫 Homework家庭作业
2022-06-21 8:00:00 2022-06-21 8:00:00 Michael迈克尔 Gardening园艺
2022-06-21 9:00:00 2022-06-21 9:00:00 Joseph约瑟夫 Rowing划船
2022-06-21 9:00:00 2022-06-21 9:00:00 Lisa丽莎 Working在职的
2022-06-21 9:15:00 2022-06-21 9:15:00 Joseph约瑟夫 Football足球
2022-06-21 10:00:00 2022-06-21 10:00:00 Joseph约瑟夫 Dancing跳舞
2022-06-21 10:00:00 2022-06-21 10:00:00 Lisa丽莎 Watering浇水
2022-06-21 10:30:00 2022-06-21 10:30:00 Joseph约瑟夫 Gaming赌博

I would like to calculate how long each activity for each person lasted.我想计算每个人的每项活动持续了多长时间。 For example create a new column like this:例如创建一个这样的新列:

Timeframe大体时间 Person Activity活动 Duration期间
2022-06-21 8:00:00 2022-06-21 8:00:00 Lisa丽莎 Working在职的 01:00:00 01:00:00
2022-06-21 8:30:00 2022-06-21 8:30:00 Joseph约瑟夫 Homework家庭作业 00:30:00 00:30:00
2022-06-21 8:00:00 2022-06-21 8:00:00 Michael迈克尔 Gardening园艺 01:15:00 01:15:00
2022-06-21 9:00:00 2022-06-21 9:00:00 Joseph约瑟夫 Rowing划船 01:00:00 01:00:00
2022-06-21 9:00:00 2022-06-21 9:00:00 Lisa丽莎 Working在职的 01:00:00 01:00:00
2022-06-21 9:15:00 2022-06-21 9:15:00 Michael迈克尔 Football足球 01:45:00 01:45:00
2022-06-21 10:00:00 2022-06-21 10:00:00 Joseph约瑟夫 Dancing跳舞 N/A不适用
2022-06-21 10:00:00 2022-06-21 10:00:00 Lisa丽莎 Watering浇水 N/A不适用
2022-06-21 10:30:00 2022-06-21 10:30:00 Michael迈克尔 Gaming赌博 N/A不适用

I need to substract the Timeframe row for each person separatly and create a new column.我需要分别减去每个人的 Timeframe 行并创建一个新列。 There is no pause in between.中间没有停顿。 How can it be done in Pyspark or alternatively in Pandas?如何在 Pyspark 或 Pandas 中完成?

Thanks!谢谢!

We can calculate the time difference in seconds and convert it to the required format.我们可以计算以秒为单位的时间差并将其转换为所需的格式。

Using a subset of your data for example.例如,使用数据的子集。

data_ls = [
    ('2022-06-21 8:00:00', 'Lisa', 'Working'),
    ('2022-06-21 8:30:00', 'Joe', 'HW'),
    ('2022-06-21 8:00:00', 'Mike', 'Gardening'),
    ('2022-06-21 9:00:00', 'Joe', 'Rowing'),
    ('2022-06-21 9:00:00', 'Lisa', 'Working')
]

data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['ts', 'name', 'activity']). \
    withColumn('ts', func.col('ts').cast('timestamp'))

# +-------------------+----+---------+
# |                 ts|name| activity|
# +-------------------+----+---------+
# |2022-06-21 08:00:00|Lisa|  Working|
# |2022-06-21 08:30:00| Joe|       HW|
# |2022-06-21 08:00:00|Mike|Gardening|
# |2022-06-21 09:00:00| Joe|   Rowing|
# |2022-06-21 09:00:00|Lisa|  Working|
# +-------------------+----+---------+

We can take a lead() (next timestamp) for each name and subtract it from current timestamp to get the duration in seconds.我们可以为每个name取一个lead() (下一个时间戳),然后从当前时间戳中减去它,以获得以秒为单位的持续时间。 Using the seconds, we can calculate minutes, hours, or even format it as a time string.使用秒,我们可以计算分钟、小时,甚至可以将其格式化为时间字符串。

data_sdf. \
    withColumn('duration_sec', 
               func.coalesce(func.lead('ts').over(wd.partitionBy('name').orderBy('ts')).cast('long') - func.col('ts').cast('long'), 
                             func.lit(0)
                             )
               ). \
    withColumn('duration_min', func.col('duration_sec') / 60). \
    withColumn('duration_time', func.from_unixtime('duration_sec', format='HH:mm:ss')). \
    show()

# +-------------------+----+---------+------------+------------+-------------+
# |                 ts|name| activity|duration_sec|duration_min|duration_time|
# +-------------------+----+---------+------------+------------+-------------+
# |2022-06-21 08:30:00| Joe|       HW|        1800|        30.0|     00:30:00|
# |2022-06-21 09:00:00| Joe|   Rowing|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Mike|Gardening|           0|         0.0|     00:00:00|
# |2022-06-21 08:00:00|Lisa|  Working|        3600|        60.0|     01:00:00|
# |2022-06-21 09:00:00|Lisa|  Working|           0|         0.0|     00:00:00|
# +-------------------+----+---------+------------+------------+-------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM