简体   繁体   English

如何为Scala可迭代的Spark数据集制作编码器

[英]How to make an Encoder for scala Iterable, spark dataset

I'm trying to create a Dataset from a RDD y 我试图创建一个从RDD数据集y

Pattern: y: RDD[(MyObj1, scala.Iterable[MyObj2])]

So I created explicitly encoder : 所以我明确创建了编码器

implicit def tuple2[A1, A2](
                              implicit e1: Encoder[A1],
                              e2: Encoder[A2]
                            ): Encoder[(A1,A2)] = Encoders.tuple[A1,A2](e1, e2) 
//Create Dataset
val z = spark.createDataset(y)(tuple2[MyObj1, Iterable[MyObj2]]) 

When I compile this code I don't have an Error but when I try to run it I get this Error : 当我编译此代码时,我没有错误,但是当我尝试运行它时,出现此错误:

Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for scala.Iterable[org.bean.input.MyObj2]
- field (class: "scala.collection.Iterable", name: "_2")
- root class: "scala.Tuple2"
        at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
        at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
        at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
        at scala.collection.immutable.List.flatMap(List.scala:344)
        at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
        at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
        at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
        at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
        at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
        at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)

Some explanation for my objects (MyObj1 & MyObj2) 我的对象的一些说明(MyObj1和MyObj2)
- MyObj1 : -MyObj1:

case class MyObj1(
                      id:String,
                      type:String
                  ) 

- MyObj2 : -MyObj2:

trait MyObj2 {
  val o_state:Option[String]

  val n_state:Option[String]

  val ch_inf: MyObj1

  val state_updated:MyObj3
}

Any Help please 任何帮助请

Spark doesn't provide Encoder for Iterables , so unless you want to use Encoder.kryo or Encoder.java this won't work. Spark不提供Iterables Encoder ,因此,除非您要使用Encoder.kryoEncoder.java ,否则它将无法正常工作。

The closest subclass of Iterable for which Spark provides Encoders is Seq , so this is probably the one you should use here. Spark为其提供EncodersIterable的最接近的子类是Seq ,因此您可能应该在这里使用它。 Otherwise refer to How to store custom objects in Dataset? 否则,请参阅如何在数据集中存储自定义对象?

Try to change declaration to: val y: RDD[(MyObj1, Seq[MyObj2])] and it would work. 尝试将声明更改为: val y: RDD[(MyObj1, Seq[MyObj2])] ,它将起作用。 I checked it for my classes: 我检查了我的课程:

case class Key(key: String) {}
case class Value(value: Int) {}

For: 对于:

val y: RDD[(Key, Seq[Value])] = sc.parallelize(Map(
  Key("A") -> List(Value(1), Value(2)),
  Key("B") -> List(Value(3), Value(4), Value(5))
).toSeq)

val z = sparkSession.createDataset(y)
z.show()

I got: 我有:

+---+---------------+
| _1|             _2|
+---+---------------+
|[A]|     [[1], [2]]|
|[B]|[[3], [4], [5]]|
+---+---------------+

If I change to Iterable I got exception you got. 如果我更改为“可Iterable则会遇到例外情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM