I have a dataframe with product id, name, and weight. I hope to calculate the percentage of products that weighted between 10-20, and also 50-60. I can think of a naive way which is count all the rows, and count the rows have weight 10-20, also 50-60, and do a division. What would be a better way to do this? Can we use some built in functions? Many thanks for your help.
id. name. weight
1. a. 11
2. b 15
3. c 26
4. d. 51
5. e. 70
It sounds like you want conditional aggregation:
select avg(case when weight between 10 and 20 then 1.0 else 0 end) as ratio_10_20,
avg(case when weight between 50 and 60 then 1.0 else 0 end) as ratio_50_60
from t;
You can use F.avg
to get the percentage of whether the column weight
is between a given interval. .cast('int')
returns 1 if the comparison is true, else 0. Its average will be the percentage you wanted to calculate.
import pyspark.sql.functions as F
df2 = df.select(
F.avg(F.col('weight').between(10,20).cast('int')).alias('10_20'),
F.avg(F.col('weight').between(50,60).cast('int')).alias('50_60')
)
df2.show()
+-----+-----+
|10_20|50_60|
+-----+-----+
| 0.4| 0.2|
+-----+-----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.