简体   繁体   中英

Taking sum ini spark-scala based on a condition

I have a data frame like this. How can i take the sum of the column sales where the rank is greater than 3 , per 'M'

+---+-----+----+
|  M|Sales|Rank|
+---+-----+----+
| M1|  200|   1|
| M1|  175|   2|
| M1|  150|   3|
| M1|  125|   4|
| M1|   90|   5|
| M1|   85|   6|
| M2| 1001|   1|
| M2|  500|   2|
| M2|  456|   3|
| M2|  345|   4|
| M2|  231|   5|
| M2|  123|   6|
+---+-----+----+

Expected Output --

+---+-----+----+---------------+
|  M|Sales|Rank|SumGreaterThan3|
+---+-----+----+---------------+
| M1|  200|   1|            300|
| M1|  175|   2|            300|
| M1|  150|   3|            300|
| M1|  125|   4|            300|
| M1|   90|   5|            300|
| M1|   85|   6|            300|
| M2| 1001|   1|            699|
| M2|  500|   2|            699|
| M2|  456|   3|            699|
| M2|  345|   4|            699|
| M2|  231|   5|            699|
| M2|  123|   6|            699|
+---+-----+----+---------------+

I have done sum over ROwnumber like this

df.withColumn("SumGreaterThan3",sum("Sales").over(Window.partitionBy(col("M"))))` //But this will provide total sum of sales.

To replicate the same DF-

val df = Seq(
("M1",200,1),
("M1",175,2),
("M1",150,3),
("M1",125,4),
("M1",90,5),
("M1",85,6),
("M2",1001,1),
("M2",500,2),
("M2",456,3),
("M2",345,4),
("M2",231,5),
("M2",123,6)
).toDF("M","Sales","Rank")

Well, the partition is enough to set the window function. Of course you also have to use the conditional summation by mixing sum and when .

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("M")
df.withColumn("SumGreaterThan3", sum(when('Rank > 3, 'Sales).otherwise(0)).over(w).alias("sum")).show

This will givs you the expected results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM