[英]Compilation Encoder error on spark 2.0
I am trying to move from spark 1.6 to 2.0, I get this error during compilation on 2.0 only: 我正在尝试从Spark 1.6升级到2.0,仅在2.0上编译时出现此错误:
def getSubGroupCount(df: DataFrame, colNames: String): Array[Seq[Any]] = {
val columns: Array[String] = colNames.split(',')
val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq)
subGroupCount
}
Unable to find encoder for type stored in a Dataset. 找不到用于存储在数据集中的类型的编码器。 Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. 导入spark.implicits。支持基本类型(Int,String等)和产品类型(案例类)。_在将来的版本中将添加对其他类型进行序列化的支持。 val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq) val subGroupCount:Array [Seq [Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0))。collect.toSeq)
Regards 问候
The method DataFrame.map
has changed between the versions: 在两个版本之间, DataFrame.map
方法已更改:
In Spark 1.6, it operates on the underlying RDD[Row]
and returns an RDD
: 在Spark 1.6中,它对基础的RDD[Row]
并返回RDD
:
def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R]
In Spark 2.0, DataFrame
is just an alias for Dataset[Row]
, and therefore it returns a Dataset
: 在Spark 2.0中, DataFrame
只是Dataset[Row]
的别名,因此它返回一个Dataset
:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
As you can see, the latter expects an implicit Encoder
argument, which is missing in your case. 如您所见,后者需要一个隐式的Encoder
参数,这种情况在您的情况下是丢失的。
Why is the Encoder missing ? 为什么缺少编码器 ?
First, all default encoders will be in scope once you import spark.implicits._
. 首先,一旦导入spark.implicits._
,所有默认编码器都将在作用spark.implicits._
。 However, since the mapping's result type is Any
( x => x.get(0)
returns Any
), you won't have an Encoder for it. 但是,由于映射的结果类型为Any
( x => x.get(0)
返回Any
),因此您将没有编码器。
How to fix this? 如何解决这个问题?
If there's a common type (say, String
, for the sake of example) for all the columns you're interested in, you can use getAs[String](0)
to make the mapping function's return type specific. 如果您感兴趣的所有列都有一个通用类型(例如, String
),则可以使用getAs[String](0)
来使映射函数的返回类型特定。 Once the above mentioned import is added, such types (primitives, Products) will have a matching Encoder in scope 添加上述导入后,此类类型(基元,产品)将在范围内具有匹配的编码器
If you don't have a known type that is common for all the relevant columns, and want to retain the same behavior - you can get the Dataframe's RDD
using .rdd
and use that RDD's map
operation, which will be identical to the pre-2.0 behavior: 如果您没有所有相关列都共有的已知类型,并且想要保留相同的行为,则可以使用.rdd
获取数据框的RDD
并使用该RDD的map
操作,该操作与pre- 2.0行为:
columns.map(c => df.select(c).distinct.rdd.map(x => x.get(0)).collect.toSeq)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.