簡體   English   中英

Spark:按給定年份列表匯總值

[英]Spark : aggregate values by a list of given years

我是 Scala 的新手,說我有一個數據集:

>>> ds.show()
+--------------+-----------------+-------------+
|year          |nb_product_sold  | system_year |
+--------------+-----------------+-------------+
|2010          |     1           | 2012  |
|2012          |     2           | 2012  |
|2012          |     4           | 2012  |
|2015          |     3           | 2012  |
|2019          |     4           | 2012  |
|2021          |     5           | 2012  |
+--------------+-----------------+-------+

我有一個List<Integer> years = {1, 3, 8} ,這意味着system_year年后的x年。 目標是計算system_year之后year的總銷售產品數量。

換句話說,我必須計算 2013 年、2015 年、2020 年的總銷售產品。

output 數據集應該是這樣的:

+-------+-----------------------+
|  year |    total_product_sold |
+-------+-----------------------+
| 1     |     6                 | -> 2012 - 2013 6 products sold
| 3     |     9                 | -> 2012 - 2015 9 products sold
| 8     |     13                | -> 2012 - 2020 13 products sold
+-------+-----------------------+

我想知道如何在 scala 中做到這一點? 在這種情況下我應該使用groupBy()嗎?

如果年份范圍不重疊,您可以使用 groupby case/when。 但是在這里,您需要每年進行一次 groupby,然后合並 3 個分組數據框:

val years = List(1, 3, 8)

val result = years.map{ y =>
    df.filter($"year".between($"system_year", $"system_year" + y))
      .groupBy(lit(y).as("year"))
      .agg(sum($"nb_product_sold").as("total_product_sold"))
  }.reduce(_ union _)

result.show
//+----+------------------+
//|year|total_product_sold|
//+----+------------------+
//|   1|                 6|
//|   3|                 9|
//|   8|                13|
//+----+------------------+

可能有多種做事方式,並且比我向您展示的方式更有效,但它適用於您的用例。

//Sample Data
val df = Seq((2010,1,2012),(2012,2,2012),(2012,4,2012),(2015,3,2012),(2019,4,2012),(2021,5,2012)).toDF("year","nb_product_sold","system_year")
//taking the difference of the years from system year
val df1 = df.withColumn("Difference",$"year" - $"system_year")
//getting the running total for all years present in the dataframe by partitioning
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy("year") 
val df2 = df1.withColumn("runningsum", sum("nb_product_sold").over(w)).withColumn("yearlist",lit(0)).dropDuplicates("year","system_year","Difference")
//creating Years list 
val years = List(1, 3, 8)
//creating a dataframe with total count for each year and union of all the dataframe and removing duplicates.
var df3= spark.createDataFrame(sc.emptyRDD[Row], df2.schema)
for (year <- years){
  val innerdf = df2.filter($"Difference" >= year -1 && $"Difference" <= year).withColumn("yearlist",lit(year))
  df3 = df3.union(innerdf)
}
//again doing partition by system date and doing the sum for all the years as per requirement
val w1 = Window.partitionBy("system_year").orderBy("year")
val finaldf = df3.withColumn("total_product_sold", sum("runningsum").over(w1)).select("yearlist","total_product_sold")

您可以看到 output 如下:

在此處輸入圖像描述

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM