[英]Why is the spark.implicits._ import not helping with encoder derivation inside a method?
因此,从创建的实例中导入隐式成员按预期工作,
object Test extends App {
class Bag {
implicit val ssss: String = "omg"
}
def call(): Unit = {
val bag = new Bag
import bag._
val s = implicitly[String]
println(s)
}
call()
}
但是,如果我尝试对spark.implicits._
做同样的事情
object Test extends App {
val spark: SparkSession = ...
def call(): Unit = {
import spark.implicits._
case class Person(id: Long, name: String)
// I can summon an existing encoder
// val enc = implicitly[Encoder[Long]]
// but encoder derivation is failing for some reason
// val encP = implicitly[Encoder[Person]]
val df: Dataset[Person] =
spark.range(10).map(i => Person(i, i.toString))
df.show()
}
}
它无法派生Encoder[Person]
,
Unable to find encoder for type Person. An implicit Encoder[Person] is needed to store Person instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
.map(i => Person(i, i.toString)
但是,如果我在方法之外创建 dataframe,它就会起作用,
object Test extends App {
val spark: SparkSession = ...
import spark.implicits._
case class Person(id: Long, name: String)
val df: Dataset[Person] =
spark.range(10).map(i => Person(i, i.toString))
df.show()
}
使用 Scala 版本2.13.10
和2.12.17
与 Spark 版本3.3.1
进行测试。
本地case class
是提供行为的原因。 本地 class 有所谓的免费类型,您可以在此处查看更多相关信息。 您可以尝试在本地 scope 中为Person
添加TypeTag
,看看它是否有帮助。
正如您自己发现的那样,本地Person
没有TypeTag
。 但它有WeakTypeTag
(和ClassTag
)。 让我们尝试为这样的 class 定义Encoder
。
构造TypeTag
的天真方法不起作用
scala 2.12,为什么运行时创建的TypeTag都不是可序列化的?
Scala Spark Encoders.product[X](其中 X 是案例类)一直给我“没有可用于 X 的 TypeTag”错误
implicit def ttag[A: WeakTypeTag]: TypeTag[A] = {
val ttag = null // hiding implicit by name
val wttagImpl = weakTypeTag[A].asInstanceOf[WeakTypeTag[A] {val mirror: Mirror; val tpec: TypeCreator}]
TypeTag[A](wttagImpl.mirror, wttagImpl.tpec)
}
java.lang.NoClassDefFoundError: no Java class corresponding to Person found
https://gist.github.com/DmytroMitin/41b7439d2e504e37f29b02e3500d24b1
类似的结果是
def typeToTypeTag[T](
tpe: Type,
mirror: api.Mirror[universe.type]
): TypeTag[T] = {
TypeTag(mirror, new TypeCreator {
def apply[U <: api.Universe with Singleton](m: api.Mirror[U]) = {
assert(m eq mirror, s"TypeTag[$tpe] defined in $mirror cannot be migrated to $m.")
tpe.asInstanceOf[U#Type]
}
})
}
implicit def ttag[T: WeakTypeTag]: TypeTag[T] = {
val ttag = null
typeToTypeTag(weakTypeOf[T], mirror)
}
java.lang.NoClassDefFoundError: no Java class corresponding to Person found
https://gist.github.com/DmytroMitin/c7a24abf1ff1011a1c87aa9d161d6395
implicit val personTtag: TypeTag[Person] = {
val personTtag = null
tb.eval(q"org.apache.spark.sql.catalyst.ScalaReflection.universe.typeTag[${weakTypeOf[Person]}]")
.asInstanceOf[TypeTag[Person]]
}
scala.tools.reflect.ToolBoxError: reflective toolbox failed due to unresolved free type variables
https://gist.github.com/DmytroMitin/6e35c0332f845fcd227d35ec49d4122f
这就是为具有TypeTag
的T
定义Encoder[T]
的方式
implicit def newProductEncoder[T <: Product : TypeTag]: Encoder[T] = Encoders.product[T]
object Encoders {
def product[T <: Product : TypeTag]: Encoder[T] = ExpressionEncoder()
}
object ExpressionEncoder {
def apply[T : TypeTag](): ExpressionEncoder[T] = {
val mirror = ScalaReflection.mirror
val tpe = typeTag[T].in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[T](
serializer,
deserializer,
ClassTag[T](cls)
)
}
}
让我们尝试为具有WeakTypeTag
和ClassTag
的T
修改它
implicit def apply[T: WeakTypeTag /*: ClassTag*/]: Encoder[T] = {
val tpe = weakTypeTag[T].in(mirror).tpe
val cls = mirror.runtimeClass(tpe)
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[T](
serializer,
deserializer,
ClassTag[T](cls)
)
}
java.lang.NoClassDefFoundError: no Java class corresponding to Person found
https://gist.github.com/DmytroMitin/b58848fa6575b6fab0e9b8285095cc60
// (*)
implicit def apply[T/*: WeakTypeTag*/ : ClassTag]: Encoder[T] = {
val tpe = mirror.classSymbol(classTag[T].runtimeClass).toType
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[T](
serializer,
deserializer,
classTag[T]
)
}
org.apache.spark.SparkException: Task not serializable
Caused by: java.io.NotSerializableException: Main
https://gist.github.com/DmytroMitin/0c86933f96e136d44fff555295ce01dd
所以最后让我们让Main
扩展Serializable
+---+----+
| id|name|
+---+----+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
| 6| 6|
| 7| 7|
| 8| 8|
| 9| 9|
+---+----+
https://gist.github.com/DmytroMitin/0e9b0bd2ed6237a4a1e1c40d620a9d88
所以 (*) 是正确的Encoder
。
这似乎不适用于通用的Person
case class Person[T](id: Long, name: String, t: T)
java.lang.UnsupportedOperationException: No Encoder found for Person$1
https://gist.github.com/DmytroMitin/69496ce257fc9a3a7a5fbd004c52dcc0
scala.ScalaReflectionException: free type Person is not a class
https://gist.github.com/DmytroMitin/07bfe954dca677f0a39c06779b94280e
对于通用本地 class,编码器应该是(同时使用WeakTypeTag
和ClassTag
)
implicit def apply[T: WeakTypeTag : ClassTag]: Encoder[T] = {
val tpe0 = weakTypeTag[T].in(mirror).tpe
val typeArgs = tpe0/*.dealias*/.typeArgs
val tpe = mirror.classSymbol(classTag[T].runtimeClass).toType
val tpe1 = appliedType(tpe.typeConstructor, typeArgs)
val serializer = ScalaReflection.serializerForType(tpe1)
val deserializer = ScalaReflection.deserializerForType(tpe1)
new ExpressionEncoder[T](
serializer,
deserializer,
classTag[T]
)
}
https://gist.github.com/DmytroMitin/08c8f21ffb1427bfa15dd21fbdfb77fa
好吧,现在这不适用于类型参数为通用本地 class 的通用本地 class
val df: Dataset[Person[Person[Int]]] =
spark.range(10).map(i => Person(i, i.toString, Person(i, i.toString, i.toInt)))
scala.ScalaReflectionException: free type Person is not a class
https://gist.github.com/DmytroMitin/5bceb2b81f2391c5c312a045edb827a8
编解码器的改进版本:
case class Application(tycon: ClassTag[_], targs: List[Application])
class DeepClassTag[T](val classTags: Application)
object DeepClassTag {
def apply[T: DeepClassTag]: DeepClassTag[T] = implicitly[DeepClassTag[T]]
implicit def deepClassTag0[A: ClassTag]: DeepClassTag[A] =
new DeepClassTag(Application(classTag[A], List()))
implicit def deepClassTag11[A[_], B1](implicit tycon: ClassTag[A[_]], dct1: DeepClassTag[B1]): DeepClassTag[A[B1]] =
new DeepClassTag(Application(tycon, List(dct1.classTags)))
implicit def deepClassTag12[A[_,_], B1, B2](implicit tycon: ClassTag[A[_,_]], dct1: DeepClassTag[B1], dct2: DeepClassTag[B1]): DeepClassTag[A[B1, B2]] =
new DeepClassTag(Application(tycon, List(dct1.classTags, dct2.classTags)))
// ...
implicit def deepClassTag2[A[_[_]], B1[_]](implicit tycon: ClassTag[A[B1]], dct1: DeepClassTag[B1[_]]): DeepClassTag[A[B1]] =
new DeepClassTag(Application(tycon, List(dct1.classTags)))
// ...
}
def improveStaticType[T: WeakTypeTag : DeepClassTag]: Type =
improveDynamicType(weakTypeOf[T], DeepClassTag[T].classTags)
def improveDynamicType(tpe: Type, classTags: Application): Type = {
val newTycon = improveFreeType(tpe, classTags.tycon.runtimeClass)
val targs = tpe.dealias.typeArgs
assert(targs.length == classTags.targs.length, s"( $targs ).length == ( ${classTags.targs} ).length")
val newArgs = targs.zip(classTags.targs).map((improveDynamicType _).tupled)
appliedType(newTycon, newArgs)
}
def improveFreeType(tpe: Type, cls: Class[_]): Type =
if (internal.isFreeType(tpe.typeSymbol)) {
val typeArgs = tpe.dealias.typeArgs
val typeConstructor = mirror.classSymbol(cls).toType.typeConstructor
appliedType(typeConstructor, typeArgs)
} else tpe
implicit def enc[T: WeakTypeTag : ClassTag : DeepClassTag]: Encoder[T] = {
val tpe = improveStaticType[T]
val serializer = ScalaReflection.serializerForType(tpe)
val deserializer = ScalaReflection.deserializerForType(tpe)
new ExpressionEncoder[T](
serializer,
deserializer,
classTag[T]
)
}
https://gist.github.com/DmytroMitin/56044515e031fcf1e977ab213013861d
DeepClassTag
似乎不适用于更高种类的课程
https://gist.github.com/DmytroMitin/6388a437507e8389f30230e08382d9ff
改进后的版本,但仍然不能正常工作(类型构造函数的形状太多)
https://gist.github.com/DmytroMitin/2625ee20695404c6fc118ab8680808f2
可以使用宏定义类型 class DeepClassTag
,而不是为不同形状的类型构造函数手动定义类型类实例,如下所示
import scala.language.experimental.macros
import scala.reflect.ClassTag
import scala.reflect.macros.whitebox
case class Application(tycon: ClassTag[_], targs: List[Application])
class DeepClassTag[T](val classTags: Application)
object DeepClassTag {
def apply[T: DeepClassTag]: DeepClassTag[T] = implicitly[DeepClassTag[T]]
implicit def mkDeepClassTag[T]/*(implicit tCtag: ClassTag[T])*/: DeepClassTag[T] =
macro DeepClassTagMacros.mkDeepClassTagImpl[T]
}
class DeepClassTagMacros(val c: whitebox.Context) {
import c.universe._
def findInstance[TC[_]](tpe: Type)(implicit wttag: WeakTypeTag[TC[_]]): Tree =
c.inferImplicitValue(
appliedType(weakTypeOf[TC[_]].typeConstructor, tpe),
silent = false
)
def mkDeepClassTagImpl[T: WeakTypeTag]/*(tCtag: c.Tree)*/ : Tree = {
val T = weakTypeOf[T]
val tCtag = findInstance[ClassTag](T)
val targCtags = T.dealias.typeArgs.map(arg => {
val argInst = findInstance[DeepClassTag](arg)
q"$argInst.classTags"
})
val targClassTags = q"_root_.scala.List.apply[Application](..$targCtags)"
q"new DeepClassTag[$T](Application($tCtag, $targClassTags))"
}
}
(有效果吗?)
我对 Spark 的 PR 以支持本地类: https://github.com/apache/spark/pull/38740
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.