[英]How to calculate percentage over a dataframe
我有一個模擬到如下所示數據框的場景。
Area Type NrPeople
1 House 200
1 Flat 100
2 House 300
2 Flat 400
3 House 1000
4 Flat 250
如何想按降序計算和返回每個區域的人的Nr,但最重要的是我很難計算總體百分比。
結果應如下所示:
Area SumPeople %
3 1000 44%
2 700 31%
1 300 13%
4 250 11%
請參見下面的代碼示例:
HouseDf = spark.createDataFrame([("1", "House", "200"),
("1", "Flat", "100"),
("2", "House", "300"),
("2", "Flat", "400"),
("3", "House", "1000"),
("4", "Flat", "250")],
["Area", "Type", "NrPeople"])
import pyspark.sql.functions as fn
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total'))
Top = HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()
失敗的原因:/不支持的操作數類型:“ int”和“ DataFrame”
任何想法都歡迎您這樣做!
您需要窗口功能-
import pyspark.sql.functions as fn
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window
window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()
輸出:
+----+---------+------------------+
|Area|SumPeople| Percent|
+----+---------+------------------+
| 3| 1000.0| 44.44444444444444|
| 2| 700.0| 31.11111111111111|
| 1| 300.0|13.333333333333334|
| 4| 250.0| 11.11111111111111|
+----+---------+------------------+
好吧,錯誤似乎很簡單, Total
是一個data.frame,並且您不能將整數除以dataframe。 首先,您可以使用collect
將其轉換為整數
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0]
然后,使用一些其他格式,以下內容應該起作用
HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending = False)\
.withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\
.show()
+----+---------+----+
|Area|SumPeople| %|
+----+---------+----+
| 3| 1000.0|44%
|
| 2| 700.0|31%
|
| 1| 300.0|13%
|
| 4| 250.0|11%
|
+----+---------+----+
盡管我不確定%
是否是一個很好的列名,因為它很難重用,也許可以考慮將其命名為Percent
或類似名稱。
您可以使用這種方法來避免collect
步驟:
HouseDf.registerTempTable("HouseDf")
df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))
df2.show()
我還沒有測試過,但我想這將比本文中的其他答案執行得更快
這等效於以下內容(物理計划非常相似):
HouseDf.registerTempTable("HouseDf")
sql = """
select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople) from HouseDf) as new
from HouseDf
group by Area
"""
spark.sql(sql).explain(True)
spark.sql(sql).show()
幾乎可以肯定,您不想在跨整個數據集的window
中使用該選項(例如w = Window.partitionBy()
)。 實際上,Spark會就此警告您:
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.