[英]PySpark Pandas: Groupby Identifying Column and Sum Two Different Columns to Create New 2x2 Table
I have the following sample dataset: 我有以下示例数据集:
groupby prevoius current
A 1 1
A 0 1
A 0 0
A 1 0
A 1 1
A 0 1
I want to create the following table by summing "previous" and "current" columns. 我想通过汇总“上一个”和“当前”列来创建下表。
previous_total current_total
3 4
I have tried all combinations of groupby with .agg and to try and achieve the table above, but wasn't able to get anything to run successfully. 我已经尝试使用.agg来组合groupby的所有组合,并尝试实现上表,但是无法成功运行任何内容。
I also know how to do this in Python Pandas but not Pyspark. 我也知道如何在Python Pandas中做到这一点,但不了解Pyspark。
Use the sum
and groupBy
methods: 使用
sum
和groupBy
方法:
>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
| 3| 4|
+--------------+--------------+
Additionally, you could register your dataframe as a temp table and use Spark SQL to query it, which will give identical results: 此外,您可以将数据框注册为临时表,并使用Spark SQL查询它,这将得到相同的结果:
>>> df.registerTempTable("df")
>>> spark.sql("select sum(previous) as previous_total, sum(current) as current_total from df").show()
You can use and sum
: 您可以使用和
sum
:
from pyspark.sql.functions import sum
df_result = df.select(sum("previous").alias("previous_total"),
sum("current").alias("current_total"))
df_result.show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
| 3| 4|
+--------------+--------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.