简体   繁体   English

Spark 2.0上的编译编码器错误

[英]Compilation Encoder error on spark 2.0

I am trying to move from spark 1.6 to 2.0, I get this error during compilation on 2.0 only: 我正在尝试从Spark 1.6升级到2.0,仅在2.0上编译时出现此错误:

def getSubGroupCount(df: DataFrame, colNames: String): Array[Seq[Any]] = {
   val columns: Array[String] = colNames.split(',')
   val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq)
    subGroupCount
  }

Unable to find encoder for type stored in a Dataset. 找不到用于存储在数据集中的类型的编码器。 Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. 导入spark.implicits。支持基本类型(Int,String等)和产品类型(案例类)。_在将来的版本中将添加对其他类型进行序列化的支持。 val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq) val subGroupCount:Array [Seq [Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0))。collect.toSeq)

Regards 问候

The method DataFrame.map has changed between the versions: 在两个版本之间, DataFrame.map方法已更改:

  • In Spark 1.6, it operates on the underlying RDD[Row] and returns an RDD : 在Spark 1.6中,它对基础的RDD[Row]并返回RDD

     def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R] 
  • In Spark 2.0, DataFrame is just an alias for Dataset[Row] , and therefore it returns a Dataset : 在Spark 2.0中, DataFrame只是Dataset[Row]的别名,因此它返回一个Dataset

     def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] 

As you can see, the latter expects an implicit Encoder argument, which is missing in your case. 如您所见,后者需要一个隐式的Encoder参数,这种情况在您的情况下是丢失的。

Why is the Encoder missing ? 为什么缺少编码器

First, all default encoders will be in scope once you import spark.implicits._ . 首先,一旦导入spark.implicits._ ,所有默认编码器都将在作用spark.implicits._ However, since the mapping's result type is Any ( x => x.get(0) returns Any ), you won't have an Encoder for it. 但是,由于映射的结果类型为Anyx => x.get(0)返回Any ),因此您将没有编码器。

How to fix this? 如何解决这个问题?

  1. If there's a common type (say, String , for the sake of example) for all the columns you're interested in, you can use getAs[String](0) to make the mapping function's return type specific. 如果您感兴趣的所有列都有一个通用类型(例如, String ),则可以使用getAs[String](0)来使映射函数的返回类型特定。 Once the above mentioned import is added, such types (primitives, Products) will have a matching Encoder in scope 添加上述导入后,此类类型(基元,产品)将在范围内具有匹配的编码器

  2. If you don't have a known type that is common for all the relevant columns, and want to retain the same behavior - you can get the Dataframe's RDD using .rdd and use that RDD's map operation, which will be identical to the pre-2.0 behavior: 如果您没有所有相关列都共有的已知类型,并且想要保留相同的行为,则可以使用.rdd获取数据框的RDD并使用该RDD的map操作,该操作与pre- 2.0行为:

     columns.map(c => df.select(c).distinct.rdd.map(x => x.get(0)).collect.toSeq) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM