简体   繁体   中英

How can I define user-defined aggregate functions in PySpark?

I want make an user defined aggregate function in pyspark. I found some documentation for Scala and would like to achieve something similar in Python.

To be more specific, assume I already have a function like this implemented:

def process_data(df: pyspark.sql.DataFrame) -> bytes:
  ...  # do something very complicated here

and now I would like to be able to do something like:

source_df.groupBy("Foo_ID").agg(UDAF(process_data))

Now the question is - what should I put in place of UDAF ?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM