[英]How to calculate duration between records in Spark/Scala?
I wanted to calculate Days_btwn_Shpmnt which is nothing but the number of days between the Ship Date. 我想计算Days_btwn_Shpmnt,不过是发货日期之间的天数。 Need to calculate this across the first and second record and so on. 需要跨第一条记录和第二条记录进行计算,依此类推。
Can you help me how this can be done using Spark/Scala? 您能帮我使用Spark / Scala如何做到吗?
Thanks, Joe 谢谢乔
You can accomplish this using lag
function in spark. 您可以使用spark中的lag
功能来完成此操作。 A sample script shows how it can done. 一个示例脚本显示了它是如何完成的。 Please note that the date has to be formatted in yyyy-mm-dd
format for datediff
function. 请注意,日期必须使用yyyy-mm-dd
格式才能使用datediff
函数。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq((1000, "2016-01-19"), (1000, "2016-02-12"), (1000, "2016-02-18"), (1000, "2016-02-04")).toDF("product_id", "date")
val result = df.withColumn("last_date" ,lag("date", 1).over(Window.partitionBy($"product_id").orderBy($"date"))).withColumn("daysToShipMent", datediff($"date", $"last_date"))
scala> result.select("product_id", "date", "daysToShipMent" ).show()
+----------+----------+--------------+
|product_id| date|daysToShipMent|
+----------+----------+--------------+
| 1000|2016-01-19| null|
| 1000|2016-02-04| 16|
| 1000|2016-02-12| 8|
| 1000|2016-02-18| 6|
+----------+----------+--------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.