I have the following sample dataset:
groupby prevoius current
A 1 1
A 0 1
A 0 0
A 1 0
A 1 1
A 0 1
I want to create the following table by summing "previous" and "current" columns.
previous_total current_total
3 4
I have tried all combinations of groupby with .agg and to try and achieve the table above, but wasn't able to get anything to run successfully.
I also know how to do this in Python Pandas but not Pyspark.
Use the sum
and groupBy
methods:
>>> df.groupBy().sum().select(col("sum(previous)").alias("previous_total"), col("sum(current)").alias("current_total")).show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
| 3| 4|
+--------------+--------------+
Additionally, you could register your dataframe as a temp table and use Spark SQL to query it, which will give identical results:
>>> df.registerTempTable("df")
>>> spark.sql("select sum(previous) as previous_total, sum(current) as current_total from df").show()
You can use and sum
:
from pyspark.sql.functions import sum
df_result = df.select(sum("previous").alias("previous_total"),
sum("current").alias("current_total"))
df_result.show()
+--------------+--------------+
|previous_total|current_total)|
+--------------+--------------+
| 3| 4|
+--------------+--------------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.