简体   繁体   中英

PySpark dataframe with UDF

I've the below code for PySpark DF to do some calculations and aggregates by passing the below variables to the DF:

number_of_plays = 8
date_from = 2020-01-05
date_to = 2020-03-10

df_1 = df.groupBy('player_1', 'player_2').agg(count("*").alias("no_of_plays")).filter(column('no_of_plays')>number_of_plays).filter(column('game_date').between(date_from, date_to))
df_1.show()

Now I want to wrap this into Spark UDF where I can pass the 3 variables number_of_plays, date_from, date_to as parameters to this function so the function should look like

def myfn (number_of_plays, date_from, date_to):
   # do the aggregation here and return the result

To be used in my code.

Any ideas how to do it using Python 3?

No UDF is necessary here - a simple Python function will do the job:

def myfn(df, number_of_plays, date_from, date_to):
    return (df.groupBy('player_1', 'player_2')
              .agg(count("*").alias("no_of_plays"))
              .filter(column('no_of_plays') > number_of_plays)
              .filter(column('game_date').between(date_from, date_to))
           )

And you can call it directly using, say, myfn(df, 10, ...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM