简体   繁体   中英

How to do each year calculation in spark scala

I have dataframe which contains columns like Month and Qty as you can see in below table:


| Month    | Fruit  |  Qty   |

| -------- | ------ | ------ |

| 2021-01  | orange | 5223   |

| 2021-02  | orange | 23     |

| ......   | .....  | .....  |

| 2022-01  | orange | 2342   |

| 2022-02  | orange | 37667  |
 

I need to do sum of the Qty group by the Fruit for each Year . My output DF will be below table:

| Year | Fruit    | sum_of_qty_This_year  |  sum_of_qty_previous_year  |

| ---- | -------- | --------------------- | -------------------------- |

| 2022 | orange   |         29384         |             34534          |
| 2021 | orange   |         34534         |             93584          |


but there is a catch here, consider below table.

| current year  | jan   | feb   | mar   | apr   | may   | jun   | jul   | aug   | sep | oct | nov   | dec |      
| --------------------------------------------------------------------------------------------------------|         
| previous year | jan   | feb   |       | apr   | may   | jun   | jul   | aug   |     | oct | nov   | dec |

as you can see the data for mar and Sep is missing in previous year. So when we calculate sum of current year, Qty should exclude the missing months. and this should be done for each year

Here is my draft. It needs improvement, but I think you'll understand common idea:

val dataDF = spark
    .read
    .option("multiline", true)
    .json("fruit.json")
    .sort("month")

  dataDF.show(100)

  val dataDFYearMonth = dataDF
    .withColumn("year", substring($"month", 1, 4))
    .withColumn("month", substring($"month", 6, 2))

  val windowSpecOrderByYear = Window.orderBy("year")

  val previousYearMonthsDF = dataDFYearMonth
    .groupBy($"year")
    .agg(collect_set($"month").as("months"))
    .sort($"year")
    .withColumn("prev_months", lag("months", 1).over(windowSpecOrderByYear))

  previousYearMonthsDF.show(10, false)

  val dataDFPrevYearMonths =
    dataDFYearMonth
      .join(previousYearMonthsDF, "year")
      .where(expr("exists(prev_months, x -> x == month)"))

  dataDFPrevYearMonths.show(10, false)

  val sumDF =
    dataDFYearMonth
      .groupBy("year", "fruit")
      .agg(sum("Qty").as("sum_of_qty_previous_year"))
      .withColumn("year_join", lag("year", -1).over(windowSpecOrderByYear))

  sumDF.show()

  val sumDFPrevYearMonths =
    dataDFPrevYearMonths
      .groupBy("year", "fruit")
      .agg(sum("Qty").as("sum_of_qty_This_year"))

  sumDFPrevYearMonths.show()


  val joinDF = sumDFPrevYearMonths.join(
    sumDF,
    sumDFPrevYearMonths("year") === sumDF("year_join"),
    "right"
  )

  joinDF.show()

  joinDF.select(coalesce(sumDFPrevYearMonths("year"), sumDF("year")), $"sum_of_qty_This_year", $"sum_of_qty_previous_year").show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM