What is an effective way to return sum distributed by days using Apache Spark or another similiar solution?

Question

Lets imagine we have a number of records with attributes: id, start_day, end_date, sum. These records have different periods defined by start and end dates and length of these periods are different.

I need to get a set of records as the result like:

id, part_id, date, sum/(end_date - start_date)
...

for each day and each period. So the sum for each record is distributed between all dates that belong to the period of that record.

As example, if I had intial set:

1, 2022-12-01, 2022-12-03, 12
2, 2022-12-05, 2022-12-10, 100

I would expect to get this:

1, 1, 2022-12-01, 6
1, 2, 2022-12-02, 6
2, 1, 2022-12-05, 20
2, 2, 2022-12-06, 20
2, 3, 2022-12-07, 20
2, 4, 2022-12-08, 20
2, 5, 2022-12-09, 20

I am researching possible approaches to implement a solution to analyze data. I understand there is a way to do it using SQL in RDBMS, but if there is a way to make it better using Apache Spark or something else I would start digging there more deeply.

I tried to optimize SQL queries in RDBMS and realized it is a tough challenge for developer and Postgres both to make such queries run fast. I tried MapReduce approach using Java, it works well and it seems scalable, but I would like not to run such logic on application level.

I am not looking for exact answer if it is a complex question, would really aprreciate any opinion on what is the best tool to process such queries. Thanks!

Answer 1

You can use this expression to generate all days between 2 dates:

"sequence(start_day, end_date, interval 1 day)"

This will work fro spark 2.4+ , then use datediff to calculate the number of days between the start_date and end_date, then just divide the sum by that number:

import spark.implicits._
val df = Seq(
  (1, "2022-12-01", "2022-12-03", 12),
  (2, "2022-12-05", "2022-12-10", 100),
).toDF("id", "start_day", "end_date", "sum")

val w = Window.partitionBy("id").orderBy("date")
df.withColumn("start_day", col("start_day").cast("date"))
  .withColumn("end_date", date_add(col("end_date").cast("date"), -1))
  .withColumn("datesDiff", datediff(col("end_date"), col("start_day")) + 1)
  .withColumn("date", explode(expr("sequence(start_day, end_date, interval 1 day)")))
  .withColumn("idx", row_number().over(w))
  .withColumn("sum", col("sum").divide(col("datesDiff")))
  .select("id", "idx", "date", "sum")
  .show(false)


+---+---+----------+----+
|id |idx|date      |sum |
+---+---+----------+----+
|1  |1  |2022-12-01|6.0 |
|1  |2  |2022-12-02|6.0 |
|2  |1  |2022-12-05|20.0|
|2  |2  |2022-12-06|20.0|
|2  |3  |2022-12-07|20.0|
|2  |4  |2022-12-08|20.0|
|2  |5  |2022-12-09|20.0|
+---+---+----------+----+

What is an effective way to return sum distributed by days using Apache Spark or another similiar solution?

Question

1 answers

solution1
0 2023-01-03 22:44:11

What is an effective way to return sum distributed by days using Apache Spark or another similiar solution?

Question

1 answers

solution1 0 2023-01-03 22:44:11

solution1
0 2023-01-03 22:44:11