简体   繁体   中英

How to calculate duration between records in Spark/Scala?

请查看我的数据集的图像

I wanted to calculate Days_btwn_Shpmnt which is nothing but the number of days between the Ship Date. Need to calculate this across the first and second record and so on.

Can you help me how this can be done using Spark/Scala?

Thanks, Joe

You can accomplish this using lag function in spark. A sample script shows how it can done. Please note that the date has to be formatted in yyyy-mm-dd format for datediff function.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq((1000, "2016-01-19"), (1000, "2016-02-12"), (1000, "2016-02-18"), (1000, "2016-02-04")).toDF("product_id", "date")    
val result = df.withColumn("last_date" ,lag("date", 1).over(Window.partitionBy($"product_id").orderBy($"date"))).withColumn("daysToShipMent", datediff($"date", $"last_date"))

scala> result.select("product_id", "date", "daysToShipMent" ).show()
+----------+----------+--------------+
|product_id|      date|daysToShipMent|
+----------+----------+--------------+
|      1000|2016-01-19|          null|
|      1000|2016-02-04|            16|
|      1000|2016-02-12|             8|
|      1000|2016-02-18|             6|
+----------+----------+--------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM