So suppose I have a big spark dataframe.I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont want to loop through rows. I want it to act on the column at once.
I didnt find on the inte.net how this could be done.
Suppose I have this datafrme
A B C
5 3 2
1 7 0
Now I want to send to pandas udf to get sum of each row.
Sum
10
8
Number of columns not known.
I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.
One option I tried is combining all colmns to array column
ARR
[5,3,2]
[1,7,0]
But even here it doesnt work for me without looping. I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.
It would be nice if I could seperate each column as a one and act on the whole column at once
How do I act on the column at once? Without looping through the rows?
If I loop through the rows I guess it's no better than a regular python udf
I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below
df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')
sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))
#sf.show()
columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
("b", 36.2,5),
("c", 12.0,5),
("d", 81.0,5),
("e", 36.3,5),
("f", 12.0,5),
("g", 111.7,5)]
sf = spark.createDataFrame(data=data,schema=columns)
sf.show()
# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
.withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
).show()
#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for pdf in iterator:
yield pdf.assign(v=pdf.sum(1))
sf.mapInPandas(sum_s, schema=sch).show()
here's a simple way to do it
from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce
df = spark.createDataFrame(
[
(5,3,2),
(1,7,0),
],
["A", "B", "C"],
)
cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))
df = (
df
.withColumn(
"sum",calculate_sum
)
)
df.show()
output:
+---+---+---+---+
| A| B| C|sum|
+---+---+---+---+
| 5| 3| 2| 10|
| 1| 7| 0| 8|
+---+---+---+---+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.