简体   繁体   English

PySpark Pandas:Groupby标识列并将两个不同的列求和以创建新的2x2表

[英]PySpark Pandas: Groupby Identifying Column and Sum Two Different Columns to Create New 2x2 Table

I have the following sample dataset: 我有以下示例数据集:

groupby prevoius    current
A       1           1
A       0           1
A       0           0
A       1           0
A       1           1
A       0           1

I want to create the following table by summing "previous" and "current" columns. 我想通过汇总“上一个”和“当前”列来创建下表。

previous_total   current_total
3                4

I have tried all combinations of groupby with .agg and to try and achieve the table above, but wasn't able to get anything to run successfully. 我已经尝试使用.agg来组合groupby的所有组合,并尝试实现上表,但是无法成功运行任何内容。

I also know how to do this in Python Pandas but not Pyspark. 我也知道如何在Python Pandas中做到这一点,但不了解Pyspark。

Use the sum and groupBy methods: 使用sumgroupBy方法:

>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

Additionally, you could register your dataframe as a temp table and use Spark SQL to query it, which will give identical results: 此外,您可以将数据框注册为临时表,并使用Spark SQL查询它,这将得到相同的结果:

>>> df.registerTempTable("df")
>>> spark.sql("select sum(previous) as previous_total, sum(current) as current_total from df").show()

You can use and sum : 您可以使用和sum

from pyspark.sql.functions import sum

df_result = df.select(sum("previous").alias("previous_total"),
                      sum("current").alias("current_total"))

df_result.show()

+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用groupby的一列创建带有熊猫的X个新列 - Use one column of a groupby to create X new columns with pandas 如果日期在其他两个列中的两个日期之间,则求和并分组并创建新的分组数据框 - pandas - Sum and groupby if date is between two dates in two other columns and create new groupby data frame - pandas 熊猫数据框按两列分组,并总结一列 - Pandas Dataframe groupby two columns and sum up a column groupby和求和两列并设置为pandas中的一列 - groupby and sum two columns and set as one column in pandas 数据框中新列的 PySpark 1.5 Groupby Sum - PySpark 1.5 Groupby Sum for new column in Dataframe 使用 pandas groupby 除当前行之外的两列之间创建一个新列 - create a new column with pandas groupby division between two columns excluding the current row Groupby 多列和总和 - 使用添加的 If 条件创建新列 - Groupby multiple columns & Sum - Create new column with added If Condition 熊猫:groupby并创建一个新的列,将聚合应用于两个列 - Pandas: groupby and make a new column applying aggregate to two columns Pandas 在两列上分组并根据结果在 excel 中创建新列 - Pandas groupby on two column and create new column in excel based on result Pandas 根据 3 个不同列中的值添加带有 groupby 的新列 - Pandas add new column with groupby based on values in 3 different columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM