Spark：按給定年份列表匯總值

Question

我是 Scala 的新手，說我有一個數據集：

>>> ds.show()
+--------------+-----------------+-------------+
|year          |nb_product_sold  | system_year |
+--------------+-----------------+-------------+
|2010          |     1           | 2012  |
|2012          |     2           | 2012  |
|2012          |     4           | 2012  |
|2015          |     3           | 2012  |
|2019          |     4           | 2012  |
|2021          |     5           | 2012  |
+--------------+-----------------+-------+

我有一個List<Integer> years = {1, 3, 8} ，這意味着system_year年后的x年。 目標是計算system_year之后year的總銷售產品數量。

換句話說，我必須計算 2013 年、2015 年、2020 年的總銷售產品。

output 數據集應該是這樣的：

+-------+-----------------------+
|  year |    total_product_sold |
+-------+-----------------------+
| 1     |     6                 | -> 2012 - 2013 6 products sold
| 3     |     9                 | -> 2012 - 2015 9 products sold
| 8     |     13                | -> 2012 - 2020 13 products sold
+-------+-----------------------+

我想知道如何在 scala 中做到這一點？ 在這種情況下我應該使用groupBy()嗎？

Answer 1

如果年份范圍不重疊，您可以使用 groupby case/when。 但是在這里，您需要每年進行一次 groupby，然后合並 3 個分組數據框：

val years = List(1, 3, 8)

val result = years.map{ y =>
    df.filter($"year".between($"system_year", $"system_year" + y))
      .groupBy(lit(y).as("year"))
      .agg(sum($"nb_product_sold").as("total_product_sold"))
  }.reduce(_ union _)

result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//|   1|                 6|
//|   3|                 9|
//|   8|                13|
//+----+------------------+

Answer 2

可能有多種做事方式，並且比我向您展示的方式更有效，但它適用於您的用例。

//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year") 
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list 
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
  val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
  df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")

您可以看到 output 如下：

Spark：按給定年份列表匯總值

問題描述

2 個解決方案

解決方案1
1 已采納 2021-03-11 09:58:54

解決方案2
0 2021-03-11 06:40:08

Spark：按給定年份列表匯總值

問題描述

2 個解決方案

解決方案1 1 已采納 2021-03-11 09:58:54

解決方案2 0 2021-03-11 06:40:08

解決方案1
1 已采納 2021-03-11 09:58:54

解決方案2
0 2021-03-11 06:40:08