I'm new to Scala, say I have a dataset:
>>> ds.show()
+--------------+-----------------+-------------+
|year |nb_product_sold | system_year |
+--------------+-----------------+-------------+
|2010 | 1 | 2012 |
|2012 | 2 | 2012 |
|2012 | 4 | 2012 |
|2015 | 3 | 2012 |
|2019 | 4 | 2012 |
|2021 | 5 | 2012 |
+--------------+-----------------+-------+
and I have a List<Integer> years = {1, 3, 8}
, which means the x
year after system_year
year. The goal is to calculate the number of total sold products for each year
after system_year
.
In other words, I have to calculate the total sold products for year 2013, 2015, 2020.
The output dataset should be like this:
+-------+-----------------------+
| year | total_product_sold |
+-------+-----------------------+
| 1 | 6 | -> 2012 - 2013 6 products sold
| 3 | 9 | -> 2012 - 2015 9 products sold
| 8 | 13 | -> 2012 - 2020 13 products sold
+-------+-----------------------+
I want to know how to do this in scala? Should I use groupBy()
in this case?
You could have used a groupby case/when if the year ranges didn't overlap. But here you'll need to do a groupby for each year and then union the 3 grouped dataframes:
val years = List(1, 3, 8)
val result = years.map{ y =>
df.filter($"year".between($"system_year", $"system_year" + y))
.groupBy(lit(y).as("year"))
.agg(sum($"nb_product_sold").as("total_product_sold"))
}.reduce(_ union _)
result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//| 1| 6|
//| 3| 9|
//| 8| 13|
//+----+------------------+
There might be multiple ways of doing things and more efficient than what I am showing you but it works for your use case.
//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year")
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")
you can see the output as below:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.