Lets imagine we have a number of records with attributes: id, start_day, end_date, sum. These records have different periods defined by start and end dates and length of these periods are different.
I need to get a set of records as the result like:
id, part_id, date, sum/(end_date - start_date)
...
for each day and each period. So the sum for each record is distributed between all dates that belong to the period of that record.
As example, if I had intial set:
1, 2022-12-01, 2022-12-03, 12
2, 2022-12-05, 2022-12-10, 100
I would expect to get this:
1, 1, 2022-12-01, 6
1, 2, 2022-12-02, 6
2, 1, 2022-12-05, 20
2, 2, 2022-12-06, 20
2, 3, 2022-12-07, 20
2, 4, 2022-12-08, 20
2, 5, 2022-12-09, 20
I am researching possible approaches to implement a solution to analyze data. I understand there is a way to do it using SQL in RDBMS, but if there is a way to make it better using Apache Spark or something else I would start digging there more deeply.
I tried to optimize SQL queries in RDBMS and realized it is a tough challenge for developer and Postgres both to make such queries run fast. I tried MapReduce approach using Java, it works well and it seems scalable, but I would like not to run such logic on application level.
I am not looking for exact answer if it is a complex question, would really aprreciate any opinion on what is the best tool to process such queries. Thanks!
You can use this expression to generate all days between 2 dates:
"sequence(start_day, end_date, interval 1 day)"
This will work fro spark 2.4+ , then use datediff to calculate the number of days between the start_date and end_date, then just divide the sum by that number:
import spark.implicits._
val df = Seq(
(1, "2022-12-01", "2022-12-03", 12),
(2, "2022-12-05", "2022-12-10", 100),
).toDF("id", "start_day", "end_date", "sum")
val w = Window.partitionBy("id").orderBy("date")
df.withColumn("start_day", col("start_day").cast("date"))
.withColumn("end_date", date_add(col("end_date").cast("date"), -1))
.withColumn("datesDiff", datediff(col("end_date"), col("start_day")) + 1)
.withColumn("date", explode(expr("sequence(start_day, end_date, interval 1 day)")))
.withColumn("idx", row_number().over(w))
.withColumn("sum", col("sum").divide(col("datesDiff")))
.select("id", "idx", "date", "sum")
.show(false)
+---+---+----------+----+
|id |idx|date |sum |
+---+---+----------+----+
|1 |1 |2022-12-01|6.0 |
|1 |2 |2022-12-02|6.0 |
|2 |1 |2022-12-05|20.0|
|2 |2 |2022-12-06|20.0|
|2 |3 |2022-12-07|20.0|
|2 |4 |2022-12-08|20.0|
|2 |5 |2022-12-09|20.0|
+---+---+----------+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.