简体   繁体   English

如何计算数据框的百分比

[英]How to calculate percentage over a dataframe

I have a scenario simulating to a dataframe which looks something like below. 我有一个模拟到如下所示数据框的场景。

Area   Type    NrPeople     
1      House    200
1      Flat     100
2      House    300
2      Flat     400
3      House   1000
4      Flat     250

How want to calculate and return the Nr of people per area in descending order, but most important I struggle to calculate the overall percentage. 如何想按降序计算和返回每个区域的人的Nr,但最重要的是我很难计算总体百分比。

Result should look like this: 结果应如下所示:

Area   SumPeople      %     
3       1000        44%
2        700        31%
1        300        13%
4        250        11%

See code sample below: 请参见下面的代码示例:

HouseDf = spark.createDataFrame([("1", "House", "200"), 
                              ("1", "Flat", "100"), 
                              ("2", "House", "300"), 
                              ("2", "Flat", "400"),
                              ("3", "House", "1000"), 
                              ("4", "Flat", "250")],
                              ["Area", "Type", "NrPeople"])

import pyspark.sql.functions as fn 
Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')) 

Top = HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending=False)\
    .withColumn('%', fn.lit(HouseDf.agg(fn.sum('NrPeople'))/Total.Total))\
Top.show()

This fails with: unsupported operand type(s) for /: 'int' and 'DataFrame' 失败的原因:/不支持的操作数类型:“ int”和“ DataFrame”

Any ideas welcome how to do this ! 任何想法都欢迎您这样做!

You need window function- 您需要窗口功能-

import pyspark.sql.functions as fn 
from pyspark.sql.functions import rank,sum,col
from pyspark.sql import Window

window = Window.rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)

HouseDf\
.groupBy('Area')\
.agg(fn.sum('NrPeople').alias('SumPeople'))\
.orderBy('SumPeople', ascending=False)\
.withColumn('total',sum(col('SumPeople')).over(window))\
.withColumn('Percent',col('SumPeople')*100/col('total'))\
.drop(col('total')).show()

output : 输出:

+----+---------+------------------+
|Area|SumPeople|           Percent|
+----+---------+------------------+
|   3|   1000.0| 44.44444444444444|
|   2|    700.0| 31.11111111111111|
|   1|    300.0|13.333333333333334|
|   4|    250.0| 11.11111111111111|
+----+---------+------------------+

Well, the error seems to be pretty straight forward, Total is a data.frame and you can't divide an integer by a dataframe. 好吧,错误似乎很简单, Total是一个data.frame,并且您不能将整数除以dataframe。 First, you could convert it to an integer using collect 首先,您可以使用collect将其转换为整数

Total = HouseDf.agg(fn.sum('NrPeople').alias('Total')).collect()[0][0] 

Then, with some additional formatting, the following should work 然后,使用一些其他格式,以下内容应该起作用

HouseDf\
    .groupBy('Area')\
    .agg(fn.sum('NrPeople').alias('SumPeople'))\
    .orderBy('SumPeople', ascending = False)\
    .withColumn('%', fn.format_string("%2.0f%%\n", col('SumPeople')/Total * 100))\
    .show() 

+----+---------+----+
|Area|SumPeople|   %|
+----+---------+----+
|   3|   1000.0|44%
|
|   2|    700.0|31%
|
|   1|    300.0|13%
|
|   4|    250.0|11%
|
+----+---------+----+

Though I'm not sure if % is a very good column name as it will be harder to reuse, maybe consider naming it something like Percent or such. 尽管我不确定%是否是一个很好的列名,因为它很难重用,也许可以考虑将其命名为Percent或类似名称。

You can use this approach to avoid the collect step: 您可以使用这种方法来避免collect步骤:

HouseDf.registerTempTable("HouseDf")

df2 = HouseDf.groupby('Area').agg(f.sum(HouseDf.NrPeople).alias("SumPeople")).withColumn("%", f.expr('SumPeople/(select sum(NrPeople) from HouseDf)'))
df2.show()

I haven't tested, but my guess is that this will perform faster than the other answers in this post 我还没有测试过,但我想这将比本文中的其他答案执行得更快

This is equivalent (the physical plan is very similar) to something like this: 这等效于以下内容(物理计划非常相似):

HouseDf.registerTempTable("HouseDf")
sql = """


select g, sum(NrPeople) as sum, sum(NrPeople)/(select sum(NrPeople)  from HouseDf) as new
from HouseDf 
group by Area

"""

spark.sql(sql).explain(True)
spark.sql(sql).show()

You almost certainly don't want to use the option with the window that spans the whole dataset (like w = Window.partitionBy() ). 几乎可以肯定,您不想在跨整个数据集的window中使用该选项(例如w = Window.partitionBy() )。 In fact Spark will warn you about this: 实际上,Spark会就此警告您:

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM