PySpark Pandas：Groupby标识列并将两个不同的列求和以创建新的2x2表

Question

I have the following sample dataset: 我有以下示例数据集：

groupby prevoius    current
A       1           1
A       0           1
A       0           0
A       1           0
A       1           1
A       0           1

I want to create the following table by summing "previous" and "current" columns. 我想通过汇总“上一个”和“当前”列来创建下表。

previous_total   current_total
3                4

I have tried all combinations of groupby with .agg and to try and achieve the table above, but wasn't able to get anything to run successfully. 我已经尝试使用.agg来组合groupby的所有组合，并尝试实现上表，但是无法成功运行任何内容。

I also know how to do this in Python Pandas but not Pyspark. 我也知道如何在Python Pandas中做到这一点，但不了解Pyspark。

Answer 1

Use the sum and groupBy methods: 使用sum和groupBy方法：

>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

Additionally, you could register your dataframe as a temp table and use Spark SQL to query it, which will give identical results: 此外，您可以将数据框注册为临时表，并使用Spark SQL查询它，这将得到相同的结果：

>>> df.registerTempTable("df")
>>> spark.sql("select sum(previous) as previous_total, sum(current) as current_total from df").show()

Answer 2

You can use and sum : 您可以使用和sum ：

from pyspark.sql.functions import sum

df_result = df.select(sum("previous").alias("previous_total"),
                      sum("current").alias("current_total"))

df_result.show()

+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

PySpark Pandas：Groupby标识列并将两个不同的列求和以创建新的2x2表

问题描述

2 个解决方案

解决方案1
1 2018-10-29 22:52:49

解决方案2
0 2018-10-30 08:15:29

PySpark Pandas：Groupby标识列并将两个不同的列求和以创建新的2x2表

问题描述

2 个解决方案

解决方案1 1 2018-10-29 22:52:49

解决方案2 0 2018-10-30 08:15:29

解决方案1
1 2018-10-29 22:52:49

解决方案2
0 2018-10-30 08:15:29