简体   繁体   English

如何通过 pyspark 中的列来查找第一个值和最后一个值之间的差异

[英]How to find difference between first and last values partitionby a columns in pyspark

I have a dataframe like this:我有一个像这样的 dataframe:

+---------+-----+----+-----------------------+                                  
|label |value|unit|dateTime               |
+---------+-----+----+-----------------------+
|Uiqcnt|475  |    |2020-04-11T21:35:13.410|
|Uiqcnt|475  |    |2020-04-11T21:35:13.910|
|Uiqcnt|475  |    |2020-04-11T21:35:14.400|
|Uiqcnt|476  |    |2020-04-11T21:35:14.910|
|Uiqcnt|476  |    |2020-04-11T21:35:15.400|
|Uiqcnt|476  |    |2020-04-11T21:35:15.910|
|Uiqcnt|477  |    |2020-04-11T21:35:16.410|
|Uiqcnt|477  |    |2020-04-11T21:35:16.910|
|Uiqcnt|477  |    |2020-04-11T21:35:17.420|
|Uiqcnt|478  |    |2020-04-11T21:35:17.920|
|Uiqcnt|478  |    |2020-04-11T21:35:18.430|

I want get the time difference partition by value.我想按值获取时差分区。 Considering large amount of data how can I do this in most efficient way?考虑到大量数据,我怎样才能以最有效的方式做到这一点?

You can group the dataset by value and the calculate the min and max dates.您可以按value对数据集进行分组并计算最小和最大日期。 After that you can calculate the difference between the min and the max.之后,您可以计算最小值和最大值之间的差异。 I assume that the results can be rounded to a second so that to_unixtimestamp can be used.我假设结果可以四舍五入到一秒,以便可以使用to_unixtimestamp

df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
    .withColumn("minUnix", F.unix_timestamp(F.col("min"))) \
    .withColumn("maxUnix", F.unix_timestamp(F.col("max"))) \
    .withColumn("diff", F.col("maxUnix") - F.col("minUnix")) \
    .select("value", "diff") \

If you need also the fractions of seconds, a udf can help:如果您还需要几分之一秒,udf 可以提供帮助:

time_delta = F.udf(lambda min, max: (max-min).total_seconds(), FloatType())

df.groupBy("value").agg(F.min("dateTime").alias("min"), F.max("dateTime").alias("max")) \
    .withColumn("diff",  time_delta(F.col("min"),F.col("max"))) \
    .show(truncate=False)

prints印刷

+-----+----------------------+----------------------+----+
|value|min                   |max                   |diff|
+-----+----------------------+----------------------+----+
|476  |2020-04-11 21:35:14.91|2020-04-11 21:35:15.91|1.0 |
|477  |2020-04-11 21:35:16.41|2020-04-11 21:35:17.42|1.01|
|478  |2020-04-11 21:35:17.92|2020-04-11 21:35:18.43|0.51|
|475  |2020-04-11 21:35:13.41|2020-04-11 21:35:14.4 |0.99|
+-----+----------------------+----------------------+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark partitionBy: How do I partition my data 然后 select 列 - Pyspark partitionBy: How do I partition my data and then select columns 使用 pyspark 查找每个对应列的两个 dataframe 值的差异 - Find difference of values on two dataframe for each corresponding columns using pyspark 如何找到组的最后一行与下一组的第一行之间的时差 - How to find time difference between last row a group and first row of next group 计算滚动窗口中第一个值与最后一个值之间的差 - Computing the difference between first and last values in a rolling window 如何在Pyspark中一起使用partitionBy和orderBy - How to use partitionBy and orderBy together in Pyspark 如何在大熊猫中的组中创建最后值和第一个值之间的差异的列 - How to create a columns with differences between last and first values within a groups in pandas spark中partitionBy和groupBy有什么区别 - What's the difference between partitionBy and groupBy in spark 如何找到 dataframe 中两列之间的差异,列中的任一行在 Python 中具有 -ve 和/或 +ve 值 - How to find the difference between two columns in a dataframe with either rows in the column have -ve and/or +ve values in Python Pyspark 发现不同模式的 2 个数据帧之间的差异 - Pyspark find difference between 2 dataframes of different schema 查找 Pyspark 中每 2 列之间的增量 - Find delta between every 2 columns in Pyspark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM