How to use a Scala function inside Pyspark?

Question

I've been searching for a while if there is any way to use a Scala function in Pyspark, and I haven't found any documentation nor guide about this subject.

My goal is to use the scala function appendPrvlngFields implicit function defined by people before. Then I want to use this function in python environment without redefining it again, but through some type ways like registering scala function

Let's say I create a simple object in Scala that uses user-defined library, something like:

%scala 
package com.Example
import org.library.DataFrameImplicits._
import org.apache.spark.sql.DataFrame


object ScalaFunction {

    def appendPrvlngFields(df: DataFrame,
                          otherDF: DataFrame,
                          colsToAppend: List[(String)] = List[(String)](),
                          mapColName: Option[String] = None,
                          partitionByCols: List[(String)],
                          sort: List[(String)],
                          sortBfirst: Boolean = false,
                          subsequent: Boolean = false,
                          existingPartitionsOnly: Boolean = false,
                          otherDFPrefix: String = "prvlg",
                          enforceLowerCase: Boolean = false
                          ): DataFrame = {
      
                          return df.appendPrvlngFields(otherDF,
                                                      colsToAppend,
                                                      mapColName,
                                                      partitionByCols,
                                                      sort,
                                                      sortBfirst,
                                                      subsequent,
                                                      existingPartitionsOnly,
                                                      otherDFPrefix,
                                                      enforceLowerCase
                                                      )
                          }
                       }

Then in python environment, I call the function appendPrvlngFields by defining this function:

def pyAppendPrvlngFields(df: DataFrame, 
                         otherDF: DataFrame, 
                         colsToAppend: list,  
                         partitionByCols: list, 
                         sort: list, 
                         mapColName = None,
                         sortBfirst = False, 
                         subsequent = False, 
                         existingPartitionsOnly = False,
                         otherDFPrefix = "prvlg",
                         enforceLowerCase = False) -> DataFrame:
  
  return(DataFrame(sc._jvm.com.SRMG.ScalaPySpark.appendPrvlngFields(df._jdf,
                                                     otherDF._jdf,
                                                     colsToAppend,  
                                                     mapColName,
                                                     partitionByCols, 
                                                     sort,sortBfirst,
                                                     subsequent),
    sqlContext))

I know I need to convert df to df._jdf, but how can I convert the list, string, Option,Boolean to java type?

Answer 1

First, you separate the ScalaUDF code into separate projects and create jar files.
Next, you pass it to python.
Here's how to do.

# create the jar using SBT
sbt clean assembly

# Pass the jar to the PySpark session
pyspark --jars [path/to/jar/x.jar]

refs: https://medium.com/wbaa/using-scala-udfs-in-pyspark-b70033dd69b9

How to use a Scala function inside Pyspark?

Question

1 answers

solution1
0 2020-12-24 04:26:30

How to use a Scala function inside Pyspark?

Question

1 answers

solution1 0 2020-12-24 04:26:30

solution1
0 2020-12-24 04:26:30