I have a scenario simulating to a dataframe which looks something like below.
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage.
Result should look like this:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
See code sample below:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame'
Any ideas welcome how to do this !
You need window function-
import pyspark.sql.functions as fn
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window
window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
output :
+----+---------+------------------+
|Area|SumPeople| Percent|
+----+---------+------------------+
| 3| 1000.0| 44.44444444444444|
| 2| 700.0| 31.11111111111111|
| 1| 300.0|13.333333333333334|
| 4| 250.0| 11.11111111111111|
+----+---------+------------------+
Well, the error seems to be pretty straight forward, Total
is a data.frame and you can't divide an integer by a dataframe. First, you could convert it to an integer using collect
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0]
Then, with some additional formatting, the following should work
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending = False)\
.withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\
.show()
+----+---------+----+
|Area|SumPeople| %|
+----+---------+----+
| 3| 1000.0|44%
|
| 2| 700.0|31%
|
| 1| 300.0|13%
|
| 4| 250.0|11%
|
+----+---------+----+
Though I'm not sure if %
is a very good column name as it will be harder to reuse, maybe consider naming it something like Percent
or such.
You can use this approach to avoid the collect
step:
HouseDf.registerTempTable("HouseDf")
df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))
df2.show()
I haven't tested, but my guess is that this will perform faster than the other answers in this post
This is equivalent (the physical plan is very similar) to something like this:
HouseDf.registerTempTable("HouseDf")
sql = """
select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople) from HouseDf) as new
from HouseDf
group by Area
"""
spark.sql(sql).explain(True)
spark.sql(sql).show()
You almost certainly don't want to use the option with the window
that spans the whole dataset (like w = Window.partitionBy()
). In fact Spark will warn you about this:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.