pyspark- how to calculate percentage

Question

I have a dataframe with product id, name, and weight. I hope to calculate the percentage of products that weighted between 10-20, and also 50-60. I can think of a naive way which is count all the rows, and count the rows have weight 10-20, also 50-60, and do a division. What would be a better way to do this? Can we use some built in functions? Many thanks for your help.

id. name. weight
 1.   a.    11
 2.   b     15
 3.   c     26
 4.   d.    51
 5.   e.    70

Answer 1

It sounds like you want conditional aggregation:

select avg(case when weight between 10 and 20 then 1.0 else 0 end) as ratio_10_20,
       avg(case when weight between 50 and 60 then 1.0 else 0 end) as ratio_50_60
from t;

Answer 2

You can use F.avg to get the percentage of whether the column weight is between a given interval. .cast('int') returns 1 if the comparison is true, else 0. Its average will be the percentage you wanted to calculate.

import pyspark.sql.functions as F

df2 = df.select(
    F.avg(F.col('weight').between(10,20).cast('int')).alias('10_20'), 
    F.avg(F.col('weight').between(50,60).cast('int')).alias('50_60')
)

df2.show()
+-----+-----+
|10_20|50_60|
+-----+-----+
|  0.4|  0.2|
+-----+-----+

pyspark- how to calculate percentage

Question

2 answers

solution1
1 2021-02-24 12:15:36

solution2
1 ACCPTED 2021-02-24 12:21:33

pyspark- how to calculate percentage

Question

2 answers

solution1 1 2021-02-24 12:15:36

solution2 1 ACCPTED 2021-02-24 12:21:33

solution1
1 2021-02-24 12:15:36

solution2
1 ACCPTED 2021-02-24 12:21:33