[英]How to calculate percentage over a dataframe
I have a scenario simulating to a dataframe which looks something like below. 我有一个模拟到如下所示数据框的场景。
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage. 如何想按降序计算和返回每个区域的人的Nr,但最重要的是我很难计算总体百分比。
Result should look like this: 结果应如下所示:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
See code sample below: 请参见下面的代码示例:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame' 失败的原因:/不支持的操作数类型:“ int”和“ DataFrame”
Any ideas welcome how to do this ! 任何想法都欢迎您这样做!
You need window function- 您需要窗口功能-
import pyspark.sql.functions as fn
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window
window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
output : 输出:
+----+---------+------------------+
|Area|SumPeople| Percent|
+----+---------+------------------+
| 3| 1000.0| 44.44444444444444|
| 2| 700.0| 31.11111111111111|
| 1| 300.0|13.333333333333334|
| 4| 250.0| 11.11111111111111|
+----+---------+------------------+
Well, the error seems to be pretty straight forward, Total
is a data.frame and you can't divide an integer by a dataframe. 好吧,错误似乎很简单, Total
是一个data.frame,并且您不能将整数除以dataframe。 First, you could convert it to an integer using collect
首先,您可以使用collect
将其转换为整数
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0]
Then, with some additional formatting, the following should work 然后,使用一些其他格式,以下内容应该起作用
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending = False)\
.withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\
.show()
+----+---------+----+
|Area|SumPeople| %|
+----+---------+----+
| 3| 1000.0|44%
|
| 2| 700.0|31%
|
| 1| 300.0|13%
|
| 4| 250.0|11%
|
+----+---------+----+
Though I'm not sure if %
is a very good column name as it will be harder to reuse, maybe consider naming it something like Percent
or such. 尽管我不确定%
是否是一个很好的列名,因为它很难重用,也许可以考虑将其命名为Percent
或类似名称。
You can use this approach to avoid the collect
step: 您可以使用这种方法来避免collect
步骤:
HouseDf.registerTempTable("HouseDf")
df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))
df2.show()
I haven't tested, but my guess is that this will perform faster than the other answers in this post 我还没有测试过,但我想这将比本文中的其他答案执行得更快
This is equivalent (the physical plan is very similar) to something like this: 这等效于以下内容(物理计划非常相似):
HouseDf.registerTempTable("HouseDf")
sql = """
select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople) from HouseDf) as new
from HouseDf
group by Area
"""
spark.sql(sql).explain(True)
spark.sql(sql).show()
You almost certainly don't want to use the option with the window
that spans the whole dataset (like w = Window.partitionBy()
). 几乎可以肯定,您不想在跨整个数据集的window
中使用该选项(例如w = Window.partitionBy()
)。 In fact Spark will warn you about this: 实际上,Spark会就此警告您:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.