简体   繁体   中英

PySpark Pandas: Groupby Identifying Column and Sum Two Different Columns to Create New 2x2 Table

I have the following sample dataset:

groupby prevoius    current
A       1           1
A       0           1
A       0           0
A       1           0
A       1           1
A       0           1

I want to create the following table by summing "previous" and "current" columns.

previous_total   current_total
3                4

I have tried all combinations of groupby with .agg and to try and achieve the table above, but wasn't able to get anything to run successfully.

I also know how to do this in Python Pandas but not Pyspark.

Use the sum and groupBy methods:

>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

Additionally, you could register your dataframe as a temp table and use Spark SQL to query it, which will give identical results:

>>> df.registerTempTable("df")
>>> spark.sql("select sum(previous) as previous_total, sum(current) as current_total from df").show()

You can use and sum :

from pyspark.sql.functions import sum

df_result = df.select(sum("previous").alias("previous_total"),
                      sum("current").alias("current_total"))

df_result.show()

+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
|             3|             4|
+--------------+--------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM