简体   繁体   English

如何使用在Scala中创建数据集的通用案例类实现特征

[英]How to implement a trait with a generic case class that creates a dataset in Scala

I want to create a Scala trait that should be implemented with a case class T. The trait is simply to load data and transform it into a Spark Dataset of type T. I got the error that no encoder can be stored, which I think is because Scala does not know that T should be a case class. 我想创建一个应该用案例类T实现的Scala特征。该特征只是加载数据并将其转换为类型T的Spark数据集。我得到一个错误,即无法存储任何编码器,我认为这是因为Scala不知道T应该是案例类。 How can I tell the compiler that? 我怎样才能告诉编译器呢? I've seen somewhere that I should mention Product, but there is no such class defined.. Feel free to suggest other ways to do this! 我见过某个地方应该提到Product,但是还没有定义此类。.随意建议其他方法!

I have the following code but it is not compiling with the error: 42: error: Unable to find encoder for type stored in a Dataset. 我有以下代码,但未编译该错误:42:错误:找不到用于存储在数据集中的类型的编码器。 Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ [INFO] .as[T] 通过导入sqlContext.implicits._ [INFO] .as [T],支持基本类型(Int,String等)和产品类型(案例类)。

I'm using Spark 1.6.1 我正在使用Spark 1.6.1

Code: 码:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Dataset, SQLContext}    

/**
      * A trait that moves data on Hadoop with Spark based on the location and the granularity of the data.
      */
    trait Agent[T] {
      /**
        * Load a Dataframe from the location and convert into a Dataset
        * @return Dataset[T]
        */
      protected def load(): Dataset[T] = {
        // Read in the data
        SparkContextKeeper.sqlContext.read
          .format("com.databricks.spark.csv")
          .load("/myfolder/" + location + "/2016/10/01/")
          .as[T]
      }
    }

Your code is missing 3 things: 您的代码缺少3件事:

  • Indeed, you must let compiler know that T is subclass of Product (the superclass of all Scala case classes and Tuples) 确实,您必须让编译器知道T是Product子类(所有Scala case类和Tuples的超类)
  • Compiler would also require the TypeTag and ClassTag of the actual case class. 编译器还将需要实际案例类的TypeTagClassTag This is used implicitly by Spark to overcome type erasure Spark隐式使用它来克服类型擦除
  • import of sqlContext.implicits._ 导入sqlContext.implicits._

Unfortunately, you can't add type parameters with context bounds in a trait , so the simplest workaround would be to use an abstract class instead: 不幸的是,您不能在trait中添加带有上下文边界的类型参数,因此最简单的解决方法是改为使用abstract class

import scala.reflect.runtime.universe.TypeTag
import scala.reflect.ClassTag

abstract class Agent[T <: Product : ClassTag : TypeTag] {
  protected def load(): Dataset[T] = { 
    val sqlContext: SQLContext = SparkContextKeeper.sqlContext
    import sqlContext.implicits._
    sqlContext.read.// same... 
  }
}

Obviously, this isn't equivalent to using a trait, and might suggest that this design isn't the best fit for the job. 显然,这并不等同于使用特征,并且可能表明该设计不是最适合这项工作的。 Another alternative is placing load in an object and moving the type parameter to the method: 另一种选择是将load放置在对象中并将类型参数移至方法:

object Agent {
  protected def load[T <: Product : ClassTag : TypeTag](): Dataset[T] = {
    // same...
  }
}

Which one is preferable is mostly up to where and how you're going to call load and what you're planning to do with the result. 哪种方法更可取,主要取决于您将在何处以及如何调用load以及您打算如何处理结果。

You need to take two actions : 您需要执行以下两项操作:

  1. Add import sparkSession.implicits._ in your imports import sparkSession.implicits._中添加import sparkSession.implicits._
  2. Make your trait trait Agent[T <: Product] 使您的特质成为trait Agent[T <: Product]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM