I wanted to calculate Days_btwn_Shpmnt which is nothing but the number of days between the Ship Date. Need to calculate this across the first and second record and so on.
Can you help me how this can be done using Spark/Scala?
Thanks, Joe
You can accomplish this using lag
function in spark. A sample script shows how it can done. Please note that the date has to be formatted in yyyy-mm-dd
format for datediff
function.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq((1000, "2016-01-19"), (1000, "2016-02-12"), (1000, "2016-02-18"), (1000, "2016-02-04")).toDF("product_id", "date")
val result = df.withColumn("last_date" ,lag("date", 1).over(Window.partitionBy($"product_id").orderBy($"date"))).withColumn("daysToShipMent", datediff($"date", $"last_date"))
scala> result.select("product_id", "date", "daysToShipMent" ).show()
+----------+----------+--------------+
|product_id| date|daysToShipMent|
+----------+----------+--------------+
| 1000|2016-01-19| null|
| 1000|2016-02-04| 16|
| 1000|2016-02-12| 8|
| 1000|2016-02-18| 6|
+----------+----------+--------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.