如何通过 pyspark 中的列来查找第一个值和最后一个值之间的差异

Question

I have a dataframe like this:我有一个像这样的 dataframe：

+---------+-----+----+-----------------------+                                  
|label |value|unit|dateTime               |
+---------+-----+----+-----------------------+
|Uiqcnt|475  |    |2020-04-11T21:35:13.410|
|Uiqcnt|475  |    |2020-04-11T21:35:13.910|
|Uiqcnt|475  |    |2020-04-11T21:35:14.400|
|Uiqcnt|476  |    |2020-04-11T21:35:14.910|
|Uiqcnt|476  |    |2020-04-11T21:35:15.400|
|Uiqcnt|476  |    |2020-04-11T21:35:15.910|
|Uiqcnt|477  |    |2020-04-11T21:35:16.410|
|Uiqcnt|477  |    |2020-04-11T21:35:16.910|
|Uiqcnt|477  |    |2020-04-11T21:35:17.420|
|Uiqcnt|478  |    |2020-04-11T21:35:17.920|
|Uiqcnt|478  |    |2020-04-11T21:35:18.430|

I want get the time difference partition by value.我想按值获取时差分区。 Considering large amount of data how can I do this in most efficient way?考虑到大量数据，我怎样才能以最有效的方式做到这一点？

Answer 1

You can group the dataset by value and the calculate the min and max dates.您可以按value对数据集进行分组并计算最小和最大日期。 After that you can calculate the difference between the min and the max.之后，您可以计算最小值和最大值之间的差异。 I assume that the results can be rounded to a second so that to_unixtimestamp can be used.我假设结果可以四舍五入到一秒，以便可以使用to_unixtimestamp 。

df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
    .withColumn("minUnix", F.unix_timestamp(F.col("min"))) \
    .withColumn("maxUnix", F.unix_timestamp(F.col("max"))) \
    .withColumn("diff", F.col("maxUnix") - F.col("minUnix")) \
    .select("value", "diff") \

If you need also the fractions of seconds, a udf can help:如果您还需要几分之一秒，udf 可以提供帮助：

time_delta = F.udf(lambda min, max: (max-min).total_seconds(), FloatType())

df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
    .withColumn("diff",  time_delta(F.col("min"),F.col("max"))) \
    .show(truncate=False)

prints印刷

+-----+----------------------+----------------------+----+
|value|min                   |max                   |diff|
+-----+----------------------+----------------------+----+
|476  |2020-04-11 21:35:14.91|2020-04-11 21:35:15.91|1.0 |
|477  |2020-04-11 21:35:16.41|2020-04-11 21:35:17.42|1.01|
|478  |2020-04-11 21:35:17.92|2020-04-11 21:35:18.43|0.51|
|475  |2020-04-11 21:35:13.41|2020-04-11 21:35:14.4 |0.99|
+-----+----------------------+----------------------+----+

如何通过 pyspark 中的列来查找第一个值和最后一个值之间的差异

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-06-25 17:02:09

如何通过 pyspark 中的列来查找第一个值和最后一个值之间的差异

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-06-25 17:02:09

解决方案1
3 已采纳 2020-06-25 17:02:09