简体   繁体   中英

How to use a Scala function inside Pyspark?

I've been searching for a while if there is any way to use a Scala function in Pyspark, and I haven't found any documentation nor guide about this subject.

My goal is to use the scala function appendPrvlngFields implicit function defined by people before. Then I want to use this function in python environment without redefining it again, but through some type ways like registering scala function

Let's say I create a simple object in Scala that uses user-defined library, something like:

%scala 
package com.Example
import org.library.DataFrameImplicits._
import org.apache.spark.sql.DataFrame


object ScalaFunction {

    def appendPrvlngFields(df: DataFrame,
                          otherDF: DataFrame,
                          colsToAppend: List[(String)] = List[(String)](),
                          mapColName: Option[String] = None,
                          partitionByCols: List[(String)],
                          sort: List[(String)],
                          sortBfirst: Boolean = false,
                          subsequent: Boolean = false,
                          existingPartitionsOnly: Boolean = false,
                          otherDFPrefix: String = "prvlg",
                          enforceLowerCase: Boolean = false
                          ): DataFrame = {
      
                          return df.appendPrvlngFields(otherDF,
                                                      colsToAppend,
                                                      mapColName,
                                                      partitionByCols,
                                                      sort,
                                                      sortBfirst,
                                                      subsequent,
                                                      existingPartitionsOnly,
                                                      otherDFPrefix,
                                                      enforceLowerCase
                                                      )
                          }
                       }

Then in python environment, I call the function appendPrvlngFields by defining this function:

def pyAppendPrvlngFields(df: DataFrame, 
                         otherDF: DataFrame, 
                         colsToAppend: list,  
                         partitionByCols: list, 
                         sort: list, 
                         mapColName = None,
                         sortBfirst = False, 
                         subsequent = False, 
                         existingPartitionsOnly = False,
                         otherDFPrefix = "prvlg",
                         enforceLowerCase = False) -> DataFrame:
  
  return(DataFrame(sc._jvm.com.SRMG.ScalaPySpark.appendPrvlngFields(df._jdf,
                                                     otherDF._jdf,
                                                     colsToAppend,  
                                                     mapColName,
                                                     partitionByCols, 
                                                     sort,sortBfirst,
                                                     subsequent),
    sqlContext))

I know I need to convert df to df._jdf, but how can I convert the list, string, Option,Boolean to java type?

First, you separate the ScalaUDF code into separate projects and create jar files.
Next, you pass it to python.
Here's how to do.

# create the jar using SBT
sbt clean assembly

# Pass the jar to the PySpark session
pyspark --jars [path/to/jar/x.jar]

refs: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM