计算 pyspark dataframe 中列的总和和平均值，并为计算值创建一个新行

Question

我有一个 pyspark dataframe

Place   Month       Sector      Estimate    Profit  
USA     1/1/2020    Sector1     5944
Col     1/1/2020    Sector1     398
IND     1/1/2020    Sector1     25
USA     1/1/2020    Sector2                 6.9%
Col     1/1/2020    Sector2                 0.4%
China   1/1/2020    Sector2                 0.0%
Aus     1/1/2020    Sector2                 7.7%

我需要计算按Month和Sector分组的所有Estimate列（包括所有值）和所有Profit列（不包括 0.0%）的平均值。

我需要在 Place 字段中添加一个额外的值，因为Every Places都具有这些总和和平均值。 所以，我想要的 dataframe 应该是这样的：

Place           Month       Sector      Estimate    Profit  
USA             1/1/2020    Sector1     5944
Col             1/1/2020    Sector1     398
IND             1/1/2020    Sector1     25
USA             1/1/2020    Sector2                 6.9%
Col             1/1/2020    Sector2                 0.4%
China           1/1/2020    Sector2                 0.0%
Aus             1/1/2020    Sector2                 7.7%
Every Places    1/1/2020    Sector1     6367
Every Places    1/1/2020    Sector2                 5%

我尝试使用此代码，但我得到：

TypeError: Column is not iterable` 错误。

df1=df.withColumn('Place',lit('Every Places')) \
               .groupBy('Month','Sector') \
               .sum((col('Estimate'))),
               avg(F.col('Profit'))

我该如何解决这个问题？

Answer 1

您可以先按Month + Sector分组以计算Estimate的总和和Profit的平均值，然后使用与原始 dataframe 的联合来获得预期的 output：

import pyspark.sql.functions as F

df = spark.createDataFrame([
    ("USA", "1/1/2020", "Sector1", 5944, None), ("Col", "1/1/2020", "Sector1", 398, None),
    ("IND", "1/1/2020", "Sector1", 25, None), ("USA", "1/1/2020", "Sector2", None, "6.9%"),
    ("Col", "1/1/2020", "Sector2", None, "0.4%"), ("China", "1/1/2020", "Sector2", None, "0.0%"),
    ("Aus", "1/1/2020", "Sector2", None, "7.7%")], ["Place", "Month", "Sector", "Estimate", "Profit"]
)

grouped_df = df.withColumn(
    "Profit",
    F.regexp_extract("Profit", "(.+)%", 1) # extract percentage from string
).groupBy("Month", "Sector").agg(
    F.sum(F.col("Estimate")).alias("Estimate"),
    F.concat(
        F.sum("Profit") / F.sum(F.when(F.col("Profit") > 0.0, 1)), # exclude 0% from calculation
        F.lit("%")
    ).alias("Profit")
).withColumn(
    "Place",
    F.lit("Every Places")
)

df1 = df.unionByName(grouped_df)

df1.show()
#+------------+--------+-------+--------+------+
#|       Place|   Month| Sector|Estimate|Profit|
#+------------+--------+-------+--------+------+
#|         USA|1/1/2020|Sector1|    5944|  null|
#|         Col|1/1/2020|Sector1|     398|  null|
#|         IND|1/1/2020|Sector1|      25|  null|
#|         USA|1/1/2020|Sector2|    null|  6.9%|
#|         Col|1/1/2020|Sector2|    null|  0.4%|
#|       China|1/1/2020|Sector2|    null|  0.0%|
#|         Aus|1/1/2020|Sector2|    null|  7.7%|
#|Every Places|1/1/2020|Sector2|    null|  5.0%|
#|Every Places|1/1/2020|Sector1|  6367.0|  null|
#+------------+--------+-------+--------+------+

计算 pyspark dataframe 中列的总和和平均值，并为计算值创建一个新行

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-01-10 11:44:55

计算 pyspark dataframe 中列的总和和平均值，并为计算值创建一个新行

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-01-10 11:44:55

解决方案1
1 已采纳 2022-01-10 11:44:55