简体   繁体   中英

Case class serialazation in flink

I am trying to build a dataset using a case class for Scala (I would like to use case classes over tuples because I want to join fields by name).

Here is one iteration of a join I am working on:

case class TestTarget(tacticId: String, partnerId:Long)

campaignPartners.join(partnerInput).where(1).equalTo("id") {
   (target, partnerInfo, out: Collector[TestTarget]) => {
       partnerInfo.partner_pricing match {
           case Some(pricing) =>
             out.collect(TestTarget(target._1, partnerInfo.partner_id))
           case None => ()
    }
  }
}

Obviously this throws the error:

org.apache.flink.api.common.InvalidProgramException: Task not serializable at org.apache.flink.api.scala.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:179) at org.apache.flink.api.scala.ClosureCleaner$.clean(ClosureCleaner.scala:171) at org.apache.flink.api.scala.DataSet.clean(DataSet.scala:121) at org.apache.flink.api.scala.JoinDataSet$$anon$2.(joinDataSet.scala:108) at org.apache.flink.api.scala.JoinDataSet.apply(joinDataSet.scala:107) at com.adfin.dataimport.vendors.dbm.Job.calculateVendorFees(Job.scala:84)

I have seen the docs here that state that I need to implement serializable for the class. As far as I can tell in new versions of Scala there is no good way to automatically serialize case classes. (I looked into manual serialization but I think I would need to do some extra work with link for this to work).

Edit: As per Till Rohrmann's suggestion I tried to reproduce this error using a small case. This is what I used to try and reproduce the error. This example worked and I failed to reproduce the error. I also tried putting Option cases everywhere but that cause the job to fail either.

val text = env.fromElements("To be, or not to be,--that is the question:--")

val words = text.flatMap { _.toLowerCase.split("\\W+") }.map(x => (1,x))

val nums = env.fromElements(List(1,2,3,4,5)).flatMap(x => x).map(x => First(1,x))



val counts = words.join(nums).where(0).equalTo("a") {
  (a, b, out: Collector[TestTarget]) => {
    b.b match {
      case 2 => ()
      case _ => out.collect(TestTarget(a._2, b.b))
    }
  }
}

The definition of my program used a class

class Job(conf: AdfinConfig)(implicit env: ExecutionEnvironment)
        extends DspJob(conf){
    ...
    case class TestTarget(tacticId: String, partnerId:Long)
    campaignPartners.join(partnerInput).where(1).equalTo("id") {
    ...
}

Since it was an inner class it wasn't being serialized automatically

If you switch the class to not be an inner class then everything works out

case class TestTarget(tacticId: String, partnerId:Long)
class Job(conf: AdfinConfig)(implicit env: ExecutionEnvironment)
        extends DspJob(conf){
    ...
    words.join( ....) 
    ...
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM