[英]Spark Error: Unable to find encoder for type stored in a Dataset
I am using Spark on a Zeppelin notebook, and groupByKey() does not seem to be working. 我在Zeppelin笔记本上使用Spark,而groupByKey()似乎不起作用。
This code: 这段代码:
df.groupByKey(row => row.getLong(0))
.mapGroups((key, iterable) => println(key))
Gives me this error (presumably a compilation error, since it shows up in no time while the dataset I am working on is pretty big): 给我这个错误(可能是一个编译错误,因为它在我正在处理的数据集很大的时候很快出现):
error: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
I tried to add a case class and map all of my rows into it, but still got the same error 我尝试添加一个case类并将所有行映射到它中,但仍然遇到了同样的错误
import spark.implicits._
case class DFRow(profileId: Long, jobId: String, state: String)
def getDFRow(row: Row):DFRow = {
return DFRow(row.getLong(row.fieldIndex("item0")),
row.getString(row.fieldIndex("item1")),
row.getString(row.fieldIndex("item2")))
}
df.map(DFRow(_))
.groupByKey(row => row.getLong(0))
.mapGroups((key, iterable) => println(key))
The schema of my Dataframe is: 我的Dataframe的架构是:
root
|-- item0: long (nullable = true)
|-- item1: string (nullable = true)
|-- item2: string (nullable = true)
You're trying to mapGroups
with a function (Long, Iterator[Row]) => Unit
and there is no Encoder
for Unit
(not that it would make sense to have one). 您正在尝试使用函数(Long, Iterator[Row]) => Unit
来mapGroups
(Long, Iterator[Row]) => Unit
并且没有Unit
Encoder
(不是说它有意义)。
In general parts of the Dataset
API which are not focused on the SQL DSL ( DataFrame => DataFrame
, DataFrame => RelationalGroupedDataset
, RelationalGroupedDataset => DataFrame
, RelationalGroupedDataset => RelationalGroupedDataset
) require either implicit or explicit encoders for the output values. 通常, Dataset
API中没有关注SQL DSL的部分( DataFrame => DataFrame
, DataFrame => RelationalGroupedDataset
, RelationalGroupedDataset => DataFrame
, RelationalGroupedDataset => RelationalGroupedDataset
)需要输出值的隐式或显式编码器。
Since there are no predefined encoders for Row
objects, using Dataset[Row]
with methods design for statically typed data doesn't make much sense. 由于Row
对象没有预定义的编码器,因此使用Dataset[Row]
和静态类型数据的方法设计没有多大意义。 As a rule of thumb you should always convert to the statically typed variant first: 根据经验,您应该首先转换为静态类型的变体:
df.as[(Long, String, String)]
See also Encoder error while trying to map dataframe row to updated row 在尝试将数据帧行映射到更新行时,请参阅编码器错误
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.