How to calculate percentage over a dataframe

I have a scenario simulating to a dataframe which looks something like below.

Area   Type    NrPeople     
1      House    200
1      Flat     100
2      House    300
2      Flat     400
3      House   1000
4      Flat     250

How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage.

Result should look like this:

Area   SumPeople      %     
3       1000        44%
2        700        31%
1        300        13%
4        250        11%

See code sample below:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
                              ("1", "Flat", "100"), 
                              ("2", "House", "300"), 
                              ("2", "Flat", "400"),
                              ("3", "House", "1000"), 
                              ("4", "Flat", "250")],
                              ["Area", "Type", "NrPeople"])

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\
    .orderBy('SumPeople', ascending=False)\
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\

This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame'

Any ideas welcome how to do this !

You need window function-

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

.orderBy('SumPeople', ascending=False)\

output :

|Area|SumPeople|           Percent|
|   3|   1000.0| 44.44444444444444|
|   2|    700.0| 31.11111111111111|
|   1|    300.0|13.333333333333334|
|   4|    250.0| 11.11111111111111|

Well, the error seems to be pretty straight forward, Total is a data.frame and you can't divide an integer by a dataframe. First, you could convert it to an integer using collect

Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0] 

Then, with some additional formatting, the following should work

    .orderBy('SumPeople', ascending = False)\
    .withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\

|Area|SumPeople|   %|
|   3|   1000.0|44%
|   2|    700.0|31%
|   1|    300.0|13%
|   4|    250.0|11%

Though I'm not sure if % is a very good column name as it will be harder to reuse, maybe consider naming it something like Percent or such.

You can use this approach to avoid the collect step:


df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))

I haven't tested, but my guess is that this will perform faster than the other answers in this post

This is equivalent (the physical plan is very similar) to something like this:

sql = """

select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople)  from HouseDf) as new
from HouseDf 
group by Area



You almost certainly don't want to use the option with the window that spans the whole dataset (like w = Window.partitionBy() ). In fact Spark will warn you about this:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

