简体   繁体   English

如何在 Spark Scala 中进行年度比较

[英]How to do yearly comparison in spark scala

I have dataframe which contains columns like month and Qty as you can see in below table.我的数据框包含月份和数量等列,如下表所示。

| | Month |月 | Fruit |水果 | Qty |数量 |

| | -------- | -------- | ------ | ------ | ------ | ------ |

| | 2021-01 | 2021-01 | orange |橙色 | 5223 | 5223 |

| | 2021-02 | 2021-02 | orange |橙色 | 23 | 23 |

| | ...... | ...... | ..... | ..... | ..... | ..... |

| | 2022-01 | 2022-01 | orange |橙色 | 2342 |第2342章

| | 2022-02 | 2022-02 | orange |橙色 | 37667 | 37667 |

I need to do sum of the qty group by the fruit.我需要按水果计算数量组的总和。 my output DF will be below table我的输出 DF 将在下表中

| | Fruit |水果 | sum_of_qty_This_year | sum_of_qty_This_year | sum_of_qty_previous_year | sum_of_qty_previous_year |

| | -------- | -------- | --------------------- | --------------------- | -------------------------- | -------------------------- |

| | orange |橙色 | 29384 | 29384 | 345345 | 345345 |

but there is a catch here, consider below table.但这里有一个问题,请考虑下表。

current year今年 jan一月 feb二月 mar马尔 apr四月 may可能 jun jul七月 aug八月 sep九月 oct十月 nov十一月 dec十二月
previous year上一年 jan一月 feb二月 apr四月 may可能 jun jul七月 aug八月 oct十月 nov十一月 dec十二月

as you can see the data for mar and sep is missing in previous year.如您所见,前一年缺少 mar 和 sep 的数据。 so when we calculate sum of current year qty it should exclude the missing months.所以当我们计算当年数量的总和时,它应该排除缺失的月份。

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._

val df1 = Seq(
  ("2021-01", "orange", 5223),
  ("2021-02", "orange", 23),
  ("2022-01", "orange", 2342),
  ("2022-02", "orange", 37667),
  ("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")

val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
  .filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
  .filter(col("Month").substr(1, 4) === priorYear)
  .withColumnRenamed("Month", "MonthP")
  .withColumnRenamed("Fruit", "FruitP")
  .withColumnRenamed("Qty", "QtyP")

val resDF = priorYearDF
  .join(
    currentYearDF,
    priorYearDF
      .col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
      .col("MonthP")
      .substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
  )
  .select(
    currentYearDF.col("Fruit").as("Fruit"),
    currentYearDF.col("Qty").as("CurrentYearQty"),
    priorYearDF.col("QtyP").as("PriorYearQty")
  )
  .groupBy("Fruit")
  .agg(
    sum("CurrentYearQty").as("sum_of_qty_This_year"),
    sum("PriorYearQty").as("sum_of_qty_previous_year")
  )

resDF.show(false)
//    +------+--------------------+------------------------+
//    |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
//    +------+--------------------+------------------------+
//    |orange|40009               |5246                    |
//    +------+--------------------+------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM