简体   繁体   中英

Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type.

def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {

    val rMirror = runtimeMirror(getClass.getClassLoader)
    val myClass = typeOf[T].typeSymbol.asClass
    val cMirror = rMirror.reflectClass(myClass)

    // The primary constructor is the first one
    val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
    val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))

    cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
  }

I'm trying to use this in the context of a spark data frame as a UDF. However, I'm not sure what's the best way to pass the case class. The approach below doesn't seem to work.

def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
    createCaseClass[T](inMap)
  }

I'm looking for something like this-

case class MyType(c1: String, c2: Long)

val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}

Thoughts and suggestions to resolve this is appreciated.

However, I'm not sure what's the best way to pass the case class

It is not possible to use case classes as arguments for user defined functions. SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects.

If you want to operate on statically typed objects please use statically typed Dataset .

From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types

You can see with:

df.schema.toString

Basic types like Int,Double, are stored like:

StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)

Complex types like case class are transformed to a combination of nested types:

StructType(StructField(..),StructField(..),StructType(..))

Sample code:

case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema

//Output:
 org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))

The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes

A basic debug technic is print to string:

 val myUdf = udf( (r:Row) =>   r.schema.toString )

then, to see was happen:

df.take(1).foreach(println) //

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM