简体   繁体   中英

Generic state management

This question is a follow-up to State management not serializable .


I want to encapsulate state management logic.

The following represents where I am at right now:

class StateManager(
  stream: DStream[(String, String)],
  updateStateFunction: (String, Option[String], State[String]) => Option[(String, String)]
) {
  lazy val myState = stream.mapWithState(stateSpec).map(_.get)
  lazy val stateSpec = StateSpec.function(updateStateFunction)
}

object StateManager {
  def apply(
    _dStream: DStream[(String, String)],
    _updateState: (String, Option[String], State[String]) => Option[(String, String)]
  ) =
    new StateManager(dStream, updateState)
}

This works fine, but only allows DStream[(String,String)] to be handled, which is a first step towards generic state management, fit to welcome any DStream : from DStream[(Int,String)] to DStream[(String,myCustomClass)] .

myState requires to be a value function in order to work ( serialization ).

But I face a problem as type parameter s don't apply to function objects in scala.

user6910411 gave me a hint by using ClassTag s with an enclosing method ( Type-parameterize a DStream ), but in turn it'd still be a method.

Would anyone have some intel on how to overcome those difficulties?


The context:

Spark 1.6

Spark Graph:

object Consumer_Orchestrator {
    def main(args: Array[String]) = {
        //setup configurations

        val streamingContext = StreamingEnvironment(/*configurations*/)

        val kafkaStream = streamingContext.stream()

        val updateStateFunction: (String, Option[String], State[String]) => Option[(String, String)] = (key, value, state) => {/*some code*/}
        val initialState = emptyRDD

        val stateManager = StateManager(kafkaStream, updateState)
        val state: DStream[(String, String)] = stateManager.myState

        state.foreachRDD(_.foreach(println))

        myStreamingContext.start()
        myStreamingContext.awaitTermination()
    }
}

The StreamingEnvironment class to create the Streaming :

class StreamingEnvironment(sparkConf: SparkConf, kafkaConf: KafkaConf) {
    val sparkContext = spark.SparkContext.getOrCreate(sparkConf)
    lazy val streamingContext = new StreamingContext(sparkContext, Seconds(30))

    mStreamingContext.checkpoint(/*directory checkpoint*/)
    mStreamingContext.remember(Minutes(1))

    def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
    def stop() = sparkContext.stop()
}

object StreamingEnvironment {
    def apply(kafkaConf: KafkaConf) = {
    val sparkConf = new SparkConf

    new StreamingEnvironment(sparkConf, kafkaConf)
    }
}

Here you are:

  • App.scala :

     import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.streaming._ import org.apache.spark.streaming.dstream.ConstantInputDStream import statemanager._ object App { def main(args: Array[String]): Unit = { val sc = new SparkContext("local[*]", "generic", new SparkConf()) val ssc = new StreamingContext(sc, Seconds(10)) ssc.checkpoint("/tmp/chk") StateManager( new ConstantInputDStream(ssc, sc.parallelize(Seq(("a", 1), ("b",2)))), (_: String, _: Option[Int], _: State[Int]) => Option(1) ).myState.print ssc.start() ssc.awaitTermination() } }
  • StateManage.scala :

     package statemanager import scala.reflect.ClassTag import org.apache.spark.streaming.{State, StateSpec} import org.apache.spark.streaming.dstream.DStream class StateManager[T : ClassTag, U : ClassTag, V : ClassTag, W : ClassTag]( stream: DStream[(T, U)], updateStateFunction: (T, Option[U], State[V]) => Option[W] ) { lazy val myState = stream.mapWithState(stateSpec).map(_.get) lazy val stateSpec = StateSpec.function(updateStateFunction) } object StateManager { def apply[T : ClassTag, U : ClassTag, V : ClassTag, W : ClassTag]( _dStream: DStream[(T, U)], _updateState: (T, Option[U], State[V]) => Option[W] ) = new StateManager(_dStream, _updateState) }
  • build.sbt :

     scalaVersion := "2.11.8" val sparkVersion = "2.1.0" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % sparkVersion, "org.apache.spark" %% "spark-streaming" % sparkVersion )
  • Directory structure:

     ├── App.scala ├── build.sbt └── StateManage.scala
  • Example execution:

     sbt run ... ------------------------------------------- Time: 1483701790000 ms ------------------------------------------- 1 1 ...

As you can see there is no magic here. If you introduce generic arguments you need ClassTags in the same context.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM