简体   繁体   English

Scala:具有可序列化的产品不带参数

[英]Scala : Product with Serializable does not take parameters

My objectif is to read Data from a csv file and convert my rdd to dataframe in scala/spark. 我的目标是从csv文件读取数据并将rdd转换为scala / spark中的数据帧。 This is my code : 这是我的代码:

package xxx.DataScience.CompensationStudy

import org.apache.spark._
import org.apache.log4j._

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.types._

import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext


object CompensationAnalysis {

  case class GetDF(profil_date:String, profil_pays:String, param_tarif2:String, param_tarif3:String, dt_titre:String, dt_langues:String,
    dt_diplomes:String, dt_experience:String, dt_formation:String, dt_outils:String, comp_applications:String, 
    comp_interventions:String, comp_competence:String)

  def main(args: Array[String]) {

    Logger.getLogger("org").setLevel(Level.ERROR)

    val conf = new SparkConf().setAppName("CompensationAnalysis ")
    val sc = new SparkContext(conf)

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._


    val lines = sc.textFile("C:/Users/../Downloads/CompensationStudy.csv").flatMap { l => 


      l.split(",") match {

        case field: Array[String] if field.size > 13 => Some(field(0), field(1), field(2), field(3), field(4), field(5), field(6), field(7), field(8), field(9), field(10), field(11), field(12))

        case field: Array[String] if field.size == 1 => Some((field(0), "default value"))

        case _ => None 
      }


    }

At this stade, I had the error : Product with Serializable does not take parameters 在此情况下,我遇到了错误: 具有可序列化的产品未接受参数

    val summary = lines.collect().map(x => GetDF(x("profil_date"), x("profil_pays"), x("param_tarif2"), x("param_tarif3"), x("dt_titre"), x("dt_langues"), x("dt_diplomes"), x("dt_experience"), x("dt_formation"), x("dt_outils"), x("comp_applications"), x("comp_interventions"), x("comp_competence")))

    val sum_df = summary.toDF()

    df.printSchema


  }

}

This is a screenshot : 这是截图:

在此处输入图片说明

Help please ? 请帮助 ?

You have several things you should improve. 您有几件事需要改进。 The most urgent problem, which causes the exception, is, as @CyrilleCorpet points out, " the three different lines in the pattern matching return values of types Some[Tuple13] , Some[Tuple2] and None.type . The least-upper-bound is then Option[Product with Serializable] which complies with flatMap 's signature (where the result should be an Iterable[T] ) modulo some implicit conversion." @CyrilleCorpet指出,引起异常的最紧迫的问题是:“模式中的三个不同行匹配类型Some[Tuple13]Some[Tuple2]None.type返回值。然后绑定Option[Product with Serializable] ,该Option[Product with Serializable]符合flatMap的签名(结果应为Iterable[T] ),并对某些隐式转换进行模运算。”

Basically, if you had Some[Tuple13] , Some[Tuple13] , and None or Some[Tuple2] , Some[Tuple2] , and None , you would be better off. 基本上,如果您有Some[Tuple13]Some[Tuple13]None Some[Tuple2]Some[Tuple2]None ,您会更好。

Also, pattern matching on types is generally a bad idea because of type erasure, and pattern matching isn't even great anyway for your situation. 另外,由于类型擦除,通常在类型上进行模式匹配不是一个好主意,而且对于您的情况而言,模式匹配甚至都不是很好。

So you could set default values in your case class: 因此,您可以在案例类中设置默认值:

case class GetDF(profile_date: String, 
                 profile_pays: String = "default", 
                 param_tarif2: String = "default", 
                 ...
)

Then in your lambda: 然后在您的lambda中:

val tokens = l.split
if (l.length > 13) {
   Some(GetDf(l(0), l(1), l(2)...))
} else if (l.length == 1) {
   Some(GetDf(l(0)))
} else {
   None
}

Now in all cases you are returning Option[GetDF] . 现在,在所有情况下,您都将返回Option[GetDF] You can flatMap the RDD to get rid of all the None s and keep only GetDF instances. 您可以flatMapRDD摆脱所有的None S和只保留GetDF实例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM