[英]How to find difference between first and last values partitionby a columns in pyspark
I have a dataframe like this:我有一个像这样的 dataframe:
+---------+-----+----+-----------------------+
|label |value|unit|dateTime |
+---------+-----+----+-----------------------+
|Uiqcnt|475 | |2020-04-11T21:35:13.410|
|Uiqcnt|475 | |2020-04-11T21:35:13.910|
|Uiqcnt|475 | |2020-04-11T21:35:14.400|
|Uiqcnt|476 | |2020-04-11T21:35:14.910|
|Uiqcnt|476 | |2020-04-11T21:35:15.400|
|Uiqcnt|476 | |2020-04-11T21:35:15.910|
|Uiqcnt|477 | |2020-04-11T21:35:16.410|
|Uiqcnt|477 | |2020-04-11T21:35:16.910|
|Uiqcnt|477 | |2020-04-11T21:35:17.420|
|Uiqcnt|478 | |2020-04-11T21:35:17.920|
|Uiqcnt|478 | |2020-04-11T21:35:18.430|
I want get the time difference partition by value.我想按值获取时差分区。 Considering large amount of data how can I do this in most efficient way?
考虑到大量数据,我怎样才能以最有效的方式做到这一点?
You can group the dataset by value
and the calculate the min and max dates.您可以按
value
对数据集进行分组并计算最小和最大日期。 After that you can calculate the difference between the min and the max.之后,您可以计算最小值和最大值之间的差异。 I assume that the results can be rounded to a second so that
to_unixtimestamp
can be used.我假设结果可以四舍五入到一秒,以便可以使用
to_unixtimestamp
。
df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
.withColumn("minUnix", F.unix_timestamp(F.col("min"))) \
.withColumn("maxUnix", F.unix_timestamp(F.col("max"))) \
.withColumn("diff", F.col("maxUnix") - F.col("minUnix")) \
.select("value", "diff") \
If you need also the fractions of seconds, a udf can help:如果您还需要几分之一秒,udf 可以提供帮助:
time_delta = F.udf(lambda min, max: (max-min).total_seconds(), FloatType())
df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
.withColumn("diff", time_delta(F.col("min"),F.col("max"))) \
.show(truncate=False)
prints印刷
+-----+----------------------+----------------------+----+
|value|min |max |diff|
+-----+----------------------+----------------------+----+
|476 |2020-04-11 21:35:14.91|2020-04-11 21:35:15.91|1.0 |
|477 |2020-04-11 21:35:16.41|2020-04-11 21:35:17.42|1.01|
|478 |2020-04-11 21:35:17.92|2020-04-11 21:35:18.43|0.51|
|475 |2020-04-11 21:35:13.41|2020-04-11 21:35:14.4 |0.99|
+-----+----------------------+----------------------+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.