Spark：按给定年份列表汇总值

Question

我是 Scala 的新手，说我有一个数据集：

>>> ds.show()
+--------------+-----------------+-------------+
|year          |nb_product_sold  | system_year |
+--------------+-----------------+-------------+
|2010          |     1           | 2012  |
|2012          |     2           | 2012  |
|2012          |     4           | 2012  |
|2015          |     3           | 2012  |
|2019          |     4           | 2012  |
|2021          |     5           | 2012  |
+--------------+-----------------+-------+

我有一个List<Integer> years = {1, 3, 8} ，这意味着system_year年后的x年。 目标是计算system_year之后year的总销售产品数量。

换句话说，我必须计算 2013 年、2015 年、2020 年的总销售产品。

output 数据集应该是这样的：

+-------+-----------------------+
|  year |    total_product_sold |
+-------+-----------------------+
| 1     |     6                 | -> 2012 - 2013 6 products sold
| 3     |     9                 | -> 2012 - 2015 9 products sold
| 8     |     13                | -> 2012 - 2020 13 products sold
+-------+-----------------------+

我想知道如何在 scala 中做到这一点？ 在这种情况下我应该使用groupBy()吗？

Answer 1

如果年份范围不重叠，您可以使用 groupby case/when。 但是在这里，您需要每年进行一次 groupby，然后合并 3 个分组数据框：

val years = List(1, 3, 8)

val result = years.map{ y =>
    df.filter($"year".between($"system_year", $"system_year" + y))
      .groupBy(lit(y).as("year"))
      .agg(sum($"nb_product_sold").as("total_product_sold"))
  }.reduce(_ union _)

result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//|   1|                 6|
//|   3|                 9|
//|   8|                13|
//+----+------------------+

Answer 2

可能有多种做事方式，并且比我向您展示的方式更有效，但它适用于您的用例。

//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year") 
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list 
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
  val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
  df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")

您可以看到 output 如下：

Spark：按给定年份列表汇总值

问题描述

2 个解决方案

解决方案1
1 已采纳 2021-03-11 09:58:54

解决方案2
0 2021-03-11 06:40:08

Spark：按给定年份列表汇总值

问题描述

2 个解决方案

解决方案1 1 已采纳 2021-03-11 09:58:54

解决方案2 0 2021-03-11 06:40:08

解决方案1
1 已采纳 2021-03-11 09:58:54

解决方案2
0 2021-03-11 06:40:08