![](/img/trans.png)
[英]Dataframe - Find sum of all values from dictionary column (row-wise) and then create new column for that Sum
[英]Calculate sum and average of a column in a pyspark dataframe and create a new row for the calculated values
我有一个 pyspark dataframe
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
我需要计算按Month
和Sector
分组的所有Estimate
列(包括所有值)和所有Profit
列(不包括 0.0%)的平均值。
我需要在 Place 字段中添加一个额外的值,因为Every Places
都具有这些总和和平均值。 所以,我想要的 dataframe 应该是这样的:
Place Month Sector Estimate Profit
USA 1/1/2020 Sector1 5944
Col 1/1/2020 Sector1 398
IND 1/1/2020 Sector1 25
USA 1/1/2020 Sector2 6.9%
Col 1/1/2020 Sector2 0.4%
China 1/1/2020 Sector2 0.0%
Aus 1/1/2020 Sector2 7.7%
Every Places 1/1/2020 Sector1 6367
Every Places 1/1/2020 Sector2 5%
我尝试使用此代码,但我得到:
TypeError: Column is not iterable` 错误。
df1=df.withColumn('Place',lit('Every Places')) \
.groupBy('Month','Sector') \
.sum((col('Estimate'))),
avg(F.col('Profit'))
我该如何解决这个问题?
您可以先按Month
+ Sector
分组以计算Estimate
的总和和Profit
的平均值,然后使用与原始 dataframe 的联合来获得预期的 output:
import pyspark.sql.functions as F
df = spark.createDataFrame([
("USA", "1/1/2020", "Sector1", 5944, None), ("Col", "1/1/2020", "Sector1", 398, None),
("IND", "1/1/2020", "Sector1", 25, None), ("USA", "1/1/2020", "Sector2", None, "6.9%"),
("Col", "1/1/2020", "Sector2", None, "0.4%"), ("China", "1/1/2020", "Sector2", None, "0.0%"),
("Aus", "1/1/2020", "Sector2", None, "7.7%")], ["Place", "Month", "Sector", "Estimate", "Profit"]
)
grouped_df = df.withColumn(
"Profit",
F.regexp_extract("Profit", "(.+)%", 1) # extract percentage from string
).groupBy("Month", "Sector").agg(
F.sum(F.col("Estimate")).alias("Estimate"),
F.concat(
F.sum("Profit") / F.sum(F.when(F.col("Profit") > 0.0, 1)), # exclude 0% from calculation
F.lit("%")
).alias("Profit")
).withColumn(
"Place",
F.lit("Every Places")
)
df1 = df.unionByName(grouped_df)
df1.show()
#+------------+--------+-------+--------+------+
#| Place| Month| Sector|Estimate|Profit|
#+------------+--------+-------+--------+------+
#| USA|1/1/2020|Sector1| 5944| null|
#| Col|1/1/2020|Sector1| 398| null|
#| IND|1/1/2020|Sector1| 25| null|
#| USA|1/1/2020|Sector2| null| 6.9%|
#| Col|1/1/2020|Sector2| null| 0.4%|
#| China|1/1/2020|Sector2| null| 0.0%|
#| Aus|1/1/2020|Sector2| null| 7.7%|
#|Every Places|1/1/2020|Sector2| null| 5.0%|
#|Every Places|1/1/2020|Sector1| 6367.0| null|
#+------------+--------+-------+--------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.