繁体   English   中英

Spark:按给定年份列表汇总值

[英]Spark : aggregate values by a list of given years

我是 Scala 的新手,说我有一个数据集:

>>> ds.show()
+--------------+-----------------+-------------+
|year          |nb_product_sold  | system_year |
+--------------+-----------------+-------------+
|2010          |     1           | 2012  |
|2012          |     2           | 2012  |
|2012          |     4           | 2012  |
|2015          |     3           | 2012  |
|2019          |     4           | 2012  |
|2021          |     5           | 2012  |
+--------------+-----------------+-------+

我有一个List<Integer> years = {1, 3, 8} ,这意味着system_year年后的x年。 目标是计算system_year之后year的总销售产品数量。

换句话说,我必须计算 2013 年、2015 年、2020 年的总销售产品。

output 数据集应该是这样的:

+-------+-----------------------+
|  year |    total_product_sold |
+-------+-----------------------+
| 1     |     6                 | -> 2012 - 2013 6 products sold
| 3     |     9                 | -> 2012 - 2015 9 products sold
| 8     |     13                | -> 2012 - 2020 13 products sold
+-------+-----------------------+

我想知道如何在 scala 中做到这一点? 在这种情况下我应该使用groupBy()吗?

如果年份范围不重叠,您可以使用 groupby case/when。 但是在这里,您需要每年进行一次 groupby,然后合并 3 个分组数据框:

val years = List(1, 3, 8)

val result = years.map{ y =>
    df.filter($"year".between($"system_year", $"system_year" + y))
      .groupBy(lit(y).as("year"))
      .agg(sum($"nb_product_sold").as("total_product_sold"))
  }.reduce(_ union _)

result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//|   1|                 6|
//|   3|                 9|
//|   8|                13|
//+----+------------------+

可能有多种做事方式,并且比我向您展示的方式更有效,但它适用于您的用例。

//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year") 
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list 
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
  val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
  df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")

您可以看到 output 如下:

在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM