I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates ( col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
I tried a lot of methods and the following are my observations:
sum
function doesn't support column addition (Pyspark version 2.3.1) sum
function is working for some folks but giving error for others. So, the addition of multiple columns can be achieved using the expr
function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.