简体   繁体   中英

Spark Streaming states to be persisted to disk in addition to in memory

I have written a program using spark streaming by using map with state function which detect repetitive records and avoid such records..the function is similar as bellow:

val trackStateFunc1 = (batchTime: Time, 
                       key: String,  
                       value: Option[(String,String)],
                       state: State[Long]) => {
  if (state.isTimingOut()) {
    None
  }
  else if (state.exists()) None
  else {
    state.update(1L)
    Some(value.get)
  }
}

val stateSpec1 = StateSpec.function(trackStateFunc1)
//.initialState(initialRDD)
.numPartitions(100)
.timeout(Minutes(30*24*60)) 

My numbers of records could be high and I kept the time-out for about one month. Therefore, number of records and keys could be high..I wanted to know if I can save these states on Disk in addition to the Memory..something like "RDD.persist(StorageLevel.MEMORY_AND_DISK_SER)"

I wanted to know if I can save these states on Disk in addition to the Memory

Stateful streaming in Spark automatically get serialized to persistent storage, this is called checkpointing . When you run your stateful DStream, you must provide a checkpoint directory otherwise the graph won't be able to execute at runtime.

You can set the checkpointing interval via DStream.checkpoint . For example, if you want to set it to every 30 seconds:

inputDStream
 .mapWithState(trackStateFunc)
 .checkpoint(Seconds(30))

Accourding to "MapWithState" sources you can try:

mapWithStateDS.dependencies.head.persist(StorageLevel.MEMORY_AND_DISK)

actual for spark 3.0.1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM