PySpark dataframe with UDF

Question

I've the below code for PySpark DF to do some calculations and aggregates by passing the below variables to the DF:

number_of_plays = 8
date_from = 2020-01-05
date_to = 2020-03-10

df_1 = df.groupBy('player_1', 'player_2').agg(count("*").alias("no_of_plays")).filter(column('no_of_plays')>number_of_plays).filter(column('game_date').between(date_from, date_to))
df_1.show()

Now I want to wrap this into Spark UDF where I can pass the 3 variables number_of_plays, date_from, date_to as parameters to this function so the function should look like

def myfn (number_of_plays, date_from, date_to):
   # do the aggregation here and return the result

To be used in my code.

Any ideas how to do it using Python 3?

Answer 1

No UDF is necessary here - a simple Python function will do the job:

def myfn(df, number_of_plays, date_from, date_to):
    return (df.groupBy('player_1', 'player_2')
              .agg(count("*").alias("no_of_plays"))
              .filter(column('no_of_plays') > number_of_plays)
              .filter(column('game_date').between(date_from, date_to))
           )

And you can call it directly using, say, myfn(df, 10, ...)

PySpark dataframe with UDF

Question

1 answers

solution1
0 ACCPTED 2020-11-15 12:02:05

PySpark dataframe with UDF

Question

1 answers

solution1 0 ACCPTED 2020-11-15 12:02:05

solution1
0 ACCPTED 2020-11-15 12:02:05