简体   繁体   English

将案例类传递给Spark UDF

[英]Pass case class to Spark UDF

I have a scala-2.11 function which creates a case class from Map based on the provided class type. 我有一个scala-2.11函数,该函数根据提供的类类型从Map创建一个案例类。

def createCaseClass[T: TypeTag, A](someMap: Map[String, A]): T = {

    val rMirror = runtimeMirror(getClass.getClassLoader)
    val myClass = typeOf[T].typeSymbol.asClass
    val cMirror = rMirror.reflectClass(myClass)

    // The primary constructor is the first one
    val ctor = typeOf[T].decl(termNames.CONSTRUCTOR).asTerm.alternatives.head.asMethod
    val argList = ctor.paramLists.flatten.map(param => someMap(param.name.toString))

    cMirror.reflectConstructor(ctor)(argList: _*).asInstanceOf[T]
  }

I'm trying to use this in the context of a spark data frame as a UDF. 我试图在Spark数据框架的上下文中将其用作UDF。 However, I'm not sure what's the best way to pass the case class. 但是,我不确定传递案例类的最佳方法是什么。 The approach below doesn't seem to work. 下面的方法似乎不起作用。

def myUDF[T: TypeTag] = udf { (inMap: Map[String, Long]) =>
    createCaseClass[T](inMap)
  }

I'm looking for something like this- 我正在寻找这样的东西-

case class MyType(c1: String, c2: Long)

val myUDF = udf{(MyType, inMap) => createCaseClass[MyType](inMap)}

Thoughts and suggestions to resolve this is appreciated. 提出了解决该问题的想法和建议。

However, I'm not sure what's the best way to pass the case class 但是,我不确定通过案例类的最佳方法是什么

It is not possible to use case classes as arguments for user defined functions. 案例类不能用作用户定义函数的参数。 SQL StructTypes are mapped to dynamically typed (for lack of a better word) Row objects. SQL StructTypes映射到动态类型(缺少更好的词)的Row对象。

If you want to operate on statically typed objects please use statically typed Dataset . 如果要对静态类型的对象进行操作,请使用静态类型的Dataset

From try and error I learn that whatever data structure that is stored in a Dataframe or Dataset is using org.apache.spark.sql.types 通过反复试验,我了解到,存储在数据框或数据集中的任何数据结构都在使用org.apache.spark.sql.types

You can see with: 您可以看到:

df.schema.toString

Basic types like Int,Double, are stored like: 基本类型(如Int,Double)的存储方式如下:

StructField(fieldname,IntegerType,true),StructField(fieldname,DoubleType,true)

Complex types like case class are transformed to a combination of nested types: 诸如case类之类的复杂类型将转换为嵌套类型的组合:

StructType(StructField(..),StructField(..),StructType(..))

Sample code: 样例代码:

case class range(min:Double,max:Double)
org.apache.spark.sql.Encoders.product[range].schema

//Output:
 org.apache.spark.sql.types.StructType = StructType(StructField(min,DoubleType,false), StructField(max,DoubleType,false))

The UDF parameter type in this cases is Row, or Seq[Row] when you store an array of case classes 在这种情况下,当您存储案例类数组时,UDF参数类型为Row或Seq [Row]

A basic debug technic is print to string: 基本的调试技术打印到字符串:

 val myUdf = udf( (r:Row) =>   r.schema.toString )

then, to see was happen: 然后,看看发生了什么:

df.take(1).foreach(println) //

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM