[英]How to do yearly comparison in spark scala
I have dataframe which contains columns like month and Qty as you can see in below table.我的数据框包含月份和数量等列,如下表所示。
| | Month |
月 | Fruit |
水果 | Qty |
数量 |
| | -------- |
-------- | ------ |
------ | ------ |
------ |
| | 2021-01 |
2021-01 | orange |
橙色 | 5223 |
5223 |
| | 2021-02 |
2021-02 | orange |
橙色 | 23 |
23 |
| | ...... |
...... | ..... |
..... | ..... |
..... |
| | 2022-01 |
2022-01 | orange |
橙色 | 2342 |
第2342章
| | 2022-02 |
2022-02 | orange |
橙色 | 37667 |
37667 |
I need to do sum of the qty group by the fruit.我需要按水果计算数量组的总和。 my output DF will be below table
我的输出 DF 将在下表中
| | Fruit |
水果 | sum_of_qty_This_year |
sum_of_qty_This_year | sum_of_qty_previous_year |
sum_of_qty_previous_year |
| | -------- |
-------- | --------------------- |
--------------------- | -------------------------- |
-------------------------- |
| | orange |
橙色 | 29384 |
29384 | 345345 |
345345 |
but there is a catch here, consider below table.但这里有一个问题,请考虑下表。
current year![]() |
jan![]() |
feb![]() |
mar![]() |
apr![]() |
may![]() |
jun![]() |
jul![]() |
aug![]() |
sep![]() |
oct![]() |
nov![]() |
dec![]() |
---|---|---|---|---|---|---|---|---|---|---|---|---|
previous year![]() |
jan![]() |
feb![]() |
apr![]() |
may![]() |
jun![]() |
jul![]() |
aug![]() |
oct![]() |
nov![]() |
dec![]() |
as you can see the data for mar and sep is missing in previous year.如您所见,前一年缺少 mar 和 sep 的数据。 so when we calculate sum of current year qty it should exclude the missing months.
所以当我们计算当年数量的总和时,它应该排除缺失的月份。
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, sum}
import spark.implicits._
val df1 = Seq(
("2021-01", "orange", 5223),
("2021-02", "orange", 23),
("2022-01", "orange", 2342),
("2022-02", "orange", 37667),
("2022-03", "orange", 50000)
).toDF("Month", "Fruit", "Qty")
val currentYear = 2022
val priorYear = 2021
val currentYearDF = df1
.filter(col("Month").substr(1, 4) === currentYear)
val priorYearDF = df1
.filter(col("Month").substr(1, 4) === priorYear)
.withColumnRenamed("Month", "MonthP")
.withColumnRenamed("Fruit", "FruitP")
.withColumnRenamed("Qty", "QtyP")
val resDF = priorYearDF
.join(
currentYearDF,
priorYearDF
.col("FruitP") === currentYearDF.col("Fruit") && priorYearDF
.col("MonthP")
.substr(6, 2) === currentYearDF.col("Month").substr(6, 2)
)
.select(
currentYearDF.col("Fruit").as("Fruit"),
currentYearDF.col("Qty").as("CurrentYearQty"),
priorYearDF.col("QtyP").as("PriorYearQty")
)
.groupBy("Fruit")
.agg(
sum("CurrentYearQty").as("sum_of_qty_This_year"),
sum("PriorYearQty").as("sum_of_qty_previous_year")
)
resDF.show(false)
// +------+--------------------+------------------------+
// |Fruit |sum_of_qty_This_year|sum_of_qty_previous_year|
// +------+--------------------+------------------------+
// |orange|40009 |5246 |
// +------+--------------------+------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.