简体   繁体   中英

How to use Avro serialization for scala case classes with Flink 1.7?

We've got a Flink job written in Scala using case classes (generated from avsc files by avrohugger) to represent our state. We would like to use Avro for serialising our state so state migration will work when we update our models. We understood since Flink 1.7 Avro serialization is supported OOTB. We added the flink-avro module to the classpath, but when restoring from a saved snapshot we notice that it's still trying to use Kryo serialization. Relevant Code snippet

case class Foo(id: String, timestamp: java.time.Instant)

val env = StreamExecutionEnvironment.getExecutionEnvironment
val conf = env.getConfig
conf.disableForceKryo()
conf.enableForceAvro()

val rawDataStream: DataStream[String] = env.addSource(MyFlinkKafkaConsumer)

val parsedDataSteam: DataStream[Foo] = rawDataStream.flatMap(new JsonParser[Foo])

// do something useful with it

env.execute("my-job")

When performing a state migration on Foo (eg by adding a field and deploying the job) I see that it tries to deserialize using Kryo, which obviously fails. How can I make sure Avro serialization is being used?

UPDATE

Found out about https://issues.apache.org/jira/browse/FLINK-10897 , so POJO state serialization with Avro is only supported from 1.8 afaik. I tried it using the latest RC of 1.8 with a simple WordCount POJO that extends from SpecificRecord:

/** MACHINE-GENERATED FROM AVRO SCHEMA. DO NOT EDIT DIRECTLY */
import scala.annotation.switch

case class WordWithCount(var word: String, var count: Long) extends 
  org.apache.avro.specific.SpecificRecordBase {
  def this() = this("", 0L)
  def get(field$: Int): AnyRef = {
    (field$: @switch) match {
      case 0 => {
        word
      }.asInstanceOf[AnyRef]
      case 1 => {
        count
      }.asInstanceOf[AnyRef]
      case _ => new org.apache.avro.AvroRuntimeException("Bad index")
    }
  }
  def put(field$: Int, value: Any): Unit = {
    (field$: @switch) match {
      case 0 => this.word = {
        value.toString
      }.asInstanceOf[String]
      case 1 => this.count = {
        value
      }.asInstanceOf[Long]
      case _ => new org.apache.avro.AvroRuntimeException("Bad index")
    }
    ()
  }
  def getSchema: org.apache.avro.Schema = WordWithCount.SCHEMA$
}

object WordWithCount {
     val SCHEMA$ = new org.apache.avro.Schema.Parser().parse(" . 
       {\"type\":\"record\",\"name\":\"WordWithCount\",\"fields\": 
       [{\"name\":\"word\",\"type\":\"string\"}, 
       {\"name\":\"count\",\"type\":\"long\"}]}")
}

This, however, also didn't work out of the box. We then tried to define our own type information using flink-avro's AvroTypeInfo but this fails because Avro looks for a SCHEMA$ property (SpecificData:285) in the class and is unable to use Java reflection to identify the SCHEMA$ in the Scala companion object.

I could never get reflection to work due to Scala's fields being private under the hood. AFAIK the only solution is to update Flink to use avro's non-reflection-based constructors in AvroInputFormat ( compare ).

In a pinch, other than Java, one could fall back to avro's GenericRecord, maybe use avro4s to generate them from avrohugger's Standard format (note that Avro4s will generate it's own schema from the generated Scala types)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM