简体   繁体   中英

How can I sum multiple columns in a spark dataframe in pyspark?

I've got a list of column names I want to sum

columns = ['col1','col2','col3']

How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)

Dataframe with result I want:

col1   col2   col3   result
 1      2      3       6

[Editing to explain each step]

If you have static list of columns, you can do this:

df.withColumn("result", col("col1") + col("col2") + col("col3"))

But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:

reduce(add, [col(x) for x in df.columns])

The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3") . But the effect would be same.

The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates ( col1col2col3 ).

[TL;DR,]

Combining the above steps, you can do this:

from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:

df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

Add multiple columns from a list into one column

I tried a lot of methods and the following are my observations:

  1. PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
  2. Built-in python's sum function is working for some folks but giving error for others.

So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.

from pyspark.sql.functions import expr

cols_list = ['a', 'b', 'c']

# Creating an addition expression using `join`
expression = '+'.join(cols_list)

df = df.withColumn('sum_cols', expr(expression))

This gives us the desired sum of columns. We can also use any other complex expression to get other output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM