Using pandas udf without looping in pyspark

Question

So suppose I have a big spark dataframe.I dont know how many columns.

(the solution has to be in pyspark using pandas udf. Not a different approach)

I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont want to loop through rows. I want it to act on the column at once.

I didnt find on the inte.net how this could be done.

Suppose I have this datafrme

A   B    C
5   3    2
1   7    0

Now I want to send to pandas udf to get sum of each row.

Sum 
 10
  8

Number of columns not known.

I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.

One option I tried is combining all colmns to array column

ARR
[5,3,2]
[1,7,0]

But even here it doesnt work for me without looping. I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.

It would be nice if I could seperate each column as a one and act on the whole column at once

How do I act on the column at once? Without looping through the rows?

If I loop through the rows I guess it's no better than a regular python udf

Answer 1

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below

df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')

sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))

#sf.show()

columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
    ("b", 36.2,5),
    ("c", 12.0,5),
    ("d", 81.0,5),
    ("e", 36.3,5),
    ("f", 12.0,5),
    ("g", 111.7,5)]

sf = spark.createDataFrame(data=data,schema=columns)

sf.show()

# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
       .withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
      ).show()


#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
           
      yield pdf.assign(v=pdf.sum(1))

sf.mapInPandas(sum_s, schema=sch).show()

Answer 2

here's a simple way to do it

from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce

df = spark.createDataFrame(
    [
        (5,3,2),
        (1,7,0),        
    ],
    ["A", "B", "C"],
)

cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))

df = (
    df
    .withColumn(
        "sum",calculate_sum
    )
)

df.show()

output:

+---+---+---+---+
|  A|  B|  C|sum|
+---+---+---+---+
|  5|  3|  2| 10|
|  1|  7|  0|  8|
+---+---+---+---+

Using pandas udf without looping in pyspark

Question

2 answers

solution1
1 2022-11-22 07:53:18

solution2
0 2022-11-22 07:54:59

Using pandas udf without looping in pyspark

Question

2 answers

solution1 1 2022-11-22 07:53:18

solution2 0 2022-11-22 07:54:59

solution1
1 2022-11-22 07:53:18

solution2
0 2022-11-22 07:54:59