How to calculate percentage over a dataframe

Question

I have a scenario simulating to a dataframe which looks something like below.

Area   Type    NrPeople     
1      House    200
1      Flat     100
2      House    300
2      Flat     400
3      House   1000
4      Flat     250

How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage.

Result should look like this:

Area   SumPeople      %     
3       1000        44%
2        700        31%
1        300        13%
4        250        11%

See code sample below:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
                              ("1", "Flat", "100"), 
                              ("2", "House", "300"), 
                              ("2", "Flat", "400"),
                              ("3", "House", "1000"), 
                              ("4", "Flat", "250")],
                              ["Area", "Type", "NrPeople"])

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending=False)\
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()

This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame'

Any ideas welcome how to do this !

Answer 1

You need window function-

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()

output :

+----+---------+------------------+
|Area|SumPeople|           Percent|
+----+---------+------------------+
|   3|   1000.0| 44.44444444444444|
|   2|    700.0| 31.11111111111111|
|   1|    300.0|13.333333333333334|
|   4|    250.0| 11.11111111111111|
+----+---------+------------------+

Answer 2

Well, the error seems to be pretty straight forward, Total is a data.frame and you can't divide an integer by a dataframe. First, you could convert it to an integer using collect

Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0]

Then, with some additional formatting, the following should work

HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending = False)\
    .withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\
    .show() 

+----+---------+----+
|Area|SumPeople|   %|
+----+---------+----+
|   3|   1000.0|44%
|
|   2|    700.0|31%
|
|   1|    300.0|13%
|
|   4|    250.0|11%
|
+----+---------+----+

Though I'm not sure if % is a very good column name as it will be harder to reuse, maybe consider naming it something like Percent or such.

Answer 3

You can use this approach to avoid the collect step:

HouseDf.registerTempTable("HouseDf")

df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))
df2.show()

I haven't tested, but my guess is that this will perform faster than the other answers in this post

This is equivalent (the physical plan is very similar) to something like this:

HouseDf.registerTempTable("HouseDf")
sql = """


select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople)  from HouseDf) as new
from HouseDf 
group by Area

"""

spark.sql(sql).explain(True)
spark.sql(sql).show()

You almost certainly don't want to use the option with the window that spans the whole dataset (like w = Window.partitionBy() ). In fact Spark will warn you about this:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

How to calculate percentage over a dataframe

Question

3 answers

solution1
5 2017-04-18 13:59:55

solution2
3 2017-04-18 13:58:56

solution3
0 2019-10-23 20:39:37

How to calculate percentage over a dataframe

Question

3 answers

solution1 5 2017-04-18 13:59:55

solution2 3 2017-04-18 13:58:56

solution3 0 2019-10-23 20:39:37

solution1
5 2017-04-18 13:59:55

solution2
3 2017-04-18 13:58:56

solution3
0 2019-10-23 20:39:37