简体   繁体   English

如何在Spark 2.X数据集中创建自定义编码器?

[英]How to create a custom Encoder in Spark 2.X Datasets?

Spark Datasets move away from Row's to Encoder 's for Pojo's/primitives. 对于Pojo /原语,Spark数据集从Row's转移到Encoder 's。 The Catalyst engine uses an ExpressionEncoder to convert columns in a SQL expression. Catalyst引擎使用ExpressionEncoder转换SQL表达式中的列。 However there do not appear to be other subclasses of Encoder available to use as a template for our own implementations. 但是,似乎没有其他Encoder子类可用作我们自己实现的模板。

Here is an example of code that is happy in Spark 1.X / DataFrames that does not compile in the new regime: 下面是一个代码,它在Spark 1.X / DataFrames中很高兴,它不能在新系统中编译:

//mapping each row to RDD tuple
df.map(row => {
    var id: String = if (!has_id) "" else row.getAs[String]("id")
    var label: String = row.getAs[String]("label")
    val channels  : Int = if (!has_channels) 0 else row.getAs[Int]("channels")
    val height  : Int = if (!has_height) 0 else row.getAs[Int]("height")
    val width : Int = if (!has_width) 0 else row.getAs[Int]("width")
    val data : Array[Byte] = row.getAs[Any]("data") match {
      case str: String => str.getBytes
      case arr: Array[Byte@unchecked] => arr
      case _ => {
        log.error("Unsupport value type")
        null
      }
    }
    (id, label, channels, height, width, data)
  }).persist(StorageLevel.DISK_ONLY)

} }

We get a compiler error of 我们得到一个编译器错误

Error:(56, 11) Unable to find encoder for type stored in a Dataset.
Primitive types (Int, String, etc) and Product types (case classes) are supported 
by importing spark.implicits._  Support for serializing other types will be added in future releases.
    df.map(row => {
          ^

So then somehow/somewhere there should be a means to 那么某种方式/某处应该有一种手段

  • Define/implement our custom Encoder 定义/实现我们的自定义编码器
  • Apply it when performing a mapping on the DataFrame (which is now a Dataset of type Row ) DataFrame上执行映射时应用它(现在是Row类型的数据集)
  • Register the Encoder for use by other custom code 注册编码器以供其他自定义代码使用

I am looking for code that successfully performs these steps. 我正在寻找成功执行这些步骤的代码。

As far as I am aware nothing really changed since 1.6 and the solutions described in How to store custom objects in Dataset? 据我所知,自从1.6以及如何在数据集中存储自定义对象中描述的解决方案后,没有真正改变 are the only available options. 是唯一可用的选项。 Nevertheless your current code should work just fine with default encoders for product types. 尽管如此,您当前的代码应该可以正常使用产品类型的默认编码器。

To get some insight why your code worked in 1.x and may not work in 2.0.0 you'll have to check the signatures. 要了解您的代码在1.x中工作的原因并且可能无法在2.0.0中运行,您必须检查签名。 In 1.x DataFrame.map is a method which takes function Row => T and transforms RDD[Row] into RDD[T] . 在1.x DataFrame.map是一个方法,它采用函数Row => T并将RDD[Row]转换为RDD[T]

In 2.0.0 DataFrame.map takes a function of type Row => T as well, but transforms Dataset[Row] (aka DataFrame ) into Dataset[T] hence T requires an Encoder . 在2.0.0中, DataFrame.map采用Row => T类型的函数,但是将Dataset[Row] (又称DataFrame )转换为Dataset[T]因此T需要一个Encoder If you want to get the "old" behavior you should use RDD explicitly: 如果您想获得“旧”行为,您应该明确使用RDD

df.rdd.map(row => ???)

For Dataset[Row] map see Encoder error while trying to map dataframe row to updated row 对于Dataset[Row] map在尝试将数据帧行映射到更新行时看到编码器错误

Did you import the implicit encoders? 你导入了隐式编码器吗?

import spark.implicits._ import spark.implicits._

http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder http://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.sql.Encoder

I imported spark.implicits._ Where spark is the SparkSession and it solved the error and custom encoders got imported. 我导入了spark.implicits._其中spark是SparkSession,它解决了错误并导入了自定义编码器。

Also, writing a custom encoder is a way out which I've not tried. 此外,编写自定义编码器是一种我没有尝试过的方法。

Working solution:- Create SparkSession and import the following 工作解决方案: - 创建SparkSession并导入以下内容

import spark.implicits._ import spark.implicits._

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM