Spark 2.0上的编译编码器错误

Question

I am trying to move from spark 1.6 to 2.0, I get this error during compilation on 2.0 only: 我正在尝试从Spark 1.6升级到2.0，仅在2.0上编译时出现此错误：

def getSubGroupCount(df: DataFrame, colNames: String): Array[Seq[Any]] = {
   val columns: Array[String] = colNames.split(',')
   val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq)
    subGroupCount
  }

Unable to find encoder for type stored in a Dataset. 找不到用于存储在数据集中的类型的编码器。 Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. 导入spark.implicits。支持基本类型（Int，String等）和产品类型（案例类）。_在将来的版本中将添加对其他类型进行序列化的支持。 val subGroupCount: Array[Seq[Any]] = columns.map(c => df.select(c).distinct.map(x => x.get(0)).collect.toSeq) val subGroupCount：Array [Seq [Any]] = columns.map（c => df.select（c）.distinct.map（x => x.get（0））。collect.toSeq）

Regards 问候

Answer 1

The method DataFrame.map has changed between the versions: 在两个版本之间， DataFrame.map方法已更改：

In Spark 1.6, it operates on the underlying RDD[Row] and returns an RDD : 在Spark 1.6中，它对基础的RDD[Row]并返回RDD ：
```
 def map[R](f: (Row) ⇒ R)(implicit arg0: ClassTag[R]): RDD[R] 
```
In Spark 2.0, DataFrame is just an alias for Dataset[Row] , and therefore it returns a Dataset : 在Spark 2.0中， DataFrame只是Dataset[Row]的别名，因此它返回一个Dataset ：
```
 def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] 
```

As you can see, the latter expects an implicit Encoder argument, which is missing in your case. 如您所见，后者需要一个隐式的Encoder参数，这种情况在您的情况下是丢失的。

Why is the Encoder missing ? 为什么缺少编码器 ？

First, all default encoders will be in scope once you import spark.implicits._ . 首先，一旦导入spark.implicits._ ，所有默认编码器都将在作用spark.implicits._ 。 However, since the mapping's result type is Any ( x => x.get(0) returns Any ), you won't have an Encoder for it. 但是，由于映射的结果类型为Any （ x => x.get(0)返回Any ），因此您将没有编码器。

How to fix this? 如何解决这个问题？

If there's a common type (say, String , for the sake of example) for all the columns you're interested in, you can use getAs[String](0) to make the mapping function's return type specific. 如果您感兴趣的所有列都有一个通用类型（例如， String ），则可以使用getAs[String](0)来使映射函数的返回类型特定。 Once the above mentioned import is added, such types (primitives, Products) will have a matching Encoder in scope 添加上述导入后，此类类型（基元，产品）将在范围内具有匹配的编码器
If you don't have a known type that is common for all the relevant columns, and want to retain the same behavior - you can get the Dataframe's RDD using .rdd and use that RDD's map operation, which will be identical to the pre-2.0 behavior: 如果您没有所有相关列都共有的已知类型，并且想要保留相同的行为，则可以使用.rdd获取数据框的RDD并使用该RDD的map操作，该操作与pre- 2.0行为：
```
 columns.map(c => df.select(c).distinct.rdd.map(x => x.get(0)).collect.toSeq) 
```

Spark 2.0上的编译编码器错误

问题描述

1 个解决方案

解决方案1
0 2017-01-26 09:09:48

Spark 2.0上的编译编码器错误

问题描述

1 个解决方案

解决方案1 0 2017-01-26 09:09:48

解决方案1
0 2017-01-26 09:09:48