简体   繁体   中英

How to do yearly comparison in spark scala

I have dataframe which contains columns like month and Qty as you can see in below table.

| Month | Fruit | Qty |

| -------- | ------ | ------ |

| 2021-01 | orange | 5223 |

| 2021-02 | orange | 23 |

| ...... | ..... | ..... |

| 2022-01 | orange | 2342 |

| 2022-02 | orange | 37667 |

I need to do sum of the qty group by the fruit. my output DF will be below table

| Fruit | sum_of_qty_This_year | sum_of_qty_previous_year |

| -------- | --------------------- | -------------------------- |

| orange | 29384 | 345345 |

but there is a catch here, consider below table.

current year jan feb mar apr may jun jul aug sep oct nov dec
previous year jan feb apr may jun jul aug oct nov dec

as you can see the data for mar and sep is missing in previous year. so when we calculate sum of current year qty it should exclude the missing months.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._

val df1 = Seq(
  ("2021-01", "orange", 5223),
  ("2021-02", "orange", 23),
  ("2022-01", "orange", 2342),
  ("2022-02", "orange", 37667),
  ("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")

val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
  .filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
  .filter(col("Month").substr(1, 4) === priorYear)
  .withColumnRenamed("Month", "MonthP")
  .withColumnRenamed("Fruit", "FruitP")
  .withColumnRenamed("Qty", "QtyP")

val resDF = priorYearDF
  .join(
    currentYearDF,
    priorYearDF
      .col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
      .col("MonthP")
      .substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
  )
  .select(
    currentYearDF.col("Fruit").as("Fruit"),
    currentYearDF.col("Qty").as("CurrentYearQty"),
    priorYearDF.col("QtyP").as("PriorYearQty")
  )
  .groupBy("Fruit")
  .agg(
    sum("CurrentYearQty").as("sum_of_qty_This_year"),
    sum("PriorYearQty").as("sum_of_qty_previous_year")
  )

resDF.show(false)
//    +------+--------------------+------------------------+
//    |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
//    +------+--------------------+------------------------+
//    |orange|40009               |5246                    |
//    +------+--------------------+------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM