简体   繁体   English

如何计算Spark / Scala中记录之间的持续时间?

[英]How to calculate duration between records in Spark/Scala?

请查看我的数据集的图像

I wanted to calculate Days_btwn_Shpmnt which is nothing but the number of days between the Ship Date. 我想计算Days_btwn_Shpmnt,不过是发货日期之间的天数。 Need to calculate this across the first and second record and so on. 需要跨第一条记录和第二条记录进行计算,依此类推。

Can you help me how this can be done using Spark/Scala? 您能帮我使用Spark / Scala如何做到吗?

Thanks, Joe 谢谢乔

You can accomplish this using lag function in spark. 您可以使用spark中的lag功能来完成此操作。 A sample script shows how it can done. 一个示例脚本显示了它是如何完成的。 Please note that the date has to be formatted in yyyy-mm-dd format for datediff function. 请注意,日期必须使用yyyy-mm-dd格式才能使用datediff函数。

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq((1000, "2016-01-19"), (1000, "2016-02-12"), (1000, "2016-02-18"), (1000, "2016-02-04")).toDF("product_id", "date")    
val result = df.withColumn("last_date" ,lag("date", 1).over(Window.partitionBy($"product_id").orderBy($"date"))).withColumn("daysToShipMent", datediff($"date", $"last_date"))

scala> result.select("product_id", "date", "daysToShipMent" ).show()
+----------+----------+--------------+
|product_id|      date|daysToShipMent|
+----------+----------+--------------+
|      1000|2016-01-19|          null|
|      1000|2016-02-04|            16|
|      1000|2016-02-12|             8|
|      1000|2016-02-18|             6|
+----------+----------+--------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM