I've the below code for PySpark DF to do some calculations and aggregates by passing the below variables to the DF:
number_of_plays = 8
date_from = 2020-01-05
date_to = 2020-03-10
df_1 = df.groupBy('player_1', 'player_2').agg(count("*").alias("no_of_plays")).filter(column('no_of_plays')>number_of_plays).filter(column('game_date').between(date_from, date_to))
df_1.show()
Now I want to wrap this into Spark UDF where I can pass the 3 variables number_of_plays, date_from, date_to as parameters to this function so the function should look like
def myfn (number_of_plays, date_from, date_to):
# do the aggregation here and return the result
To be used in my code.
Any ideas how to do it using Python 3?
No UDF is necessary here - a simple Python function will do the job:
def myfn(df, number_of_plays, date_from, date_to):
return (df.groupBy('player_1', 'player_2')
.agg(count("*").alias("no_of_plays"))
.filter(column('no_of_plays') > number_of_plays)
.filter(column('game_date').between(date_from, date_to))
)
And you can call it directly using, say, myfn(df, 10, ...)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.