简体   繁体   中英

pyspark- how to calculate percentage

I have a dataframe with product id, name, and weight. I hope to calculate the percentage of products that weighted between 10-20, and also 50-60. I can think of a naive way which is count all the rows, and count the rows have weight 10-20, also 50-60, and do a division. What would be a better way to do this? Can we use some built in functions? Many thanks for your help.

id. name. weight
 1.   a.    11
 2.   b     15
 3.   c     26
 4.   d.    51
 5.   e.    70

It sounds like you want conditional aggregation:

select avg(case when weight between 10 and 20 then 1.0 else 0 end) as ratio_10_20,
       avg(case when weight between 50 and 60 then 1.0 else 0 end) as ratio_50_60
from t;
   

You can use F.avg to get the percentage of whether the column weight is between a given interval. .cast('int') returns 1 if the comparison is true, else 0. Its average will be the percentage you wanted to calculate.

import pyspark.sql.functions as F

df2 = df.select(
    F.avg(F.col('weight').between(10,20).cast('int')).alias('10_20'), 
    F.avg(F.col('weight').between(50,60).cast('int')).alias('50_60')
)

df2.show()
+-----+-----+
|10_20|50_60|
+-----+-----+
|  0.4|  0.2|
+-----+-----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM