简体   繁体   English

除了在内存中之外,Spark Streaming 状态还要持久化到磁盘

[英]Spark Streaming states to be persisted to disk in addition to in memory

I have written a program using spark streaming by using map with state function which detect repetitive records and avoid such records..the function is similar as bellow:我已经通过使用带有状态函数的 map 编写了一个使用 spark 流的程序,该函数检测重复记录并避免此类记录..该函数类似于以下内容:

val trackStateFunc1 = (batchTime: Time, 
                       key: String,  
                       value: Option[(String,String)],
                       state: State[Long]) => {
  if (state.isTimingOut()) {
    None
  }
  else if (state.exists()) None
  else {
    state.update(1L)
    Some(value.get)
  }
}

val stateSpec1 = StateSpec.function(trackStateFunc1)
//.initialState(initialRDD)
.numPartitions(100)
.timeout(Minutes(30*24*60)) 

My numbers of records could be high and I kept the time-out for about one month.我的记录数量可能很高,而且我将超时保留了大约一个月。 Therefore, number of records and keys could be high..I wanted to know if I can save these states on Disk in addition to the Memory..something like "RDD.persist(StorageLevel.MEMORY_AND_DISK_SER)"因此,记录和键的数量可能会很高。

I wanted to know if I can save these states on Disk in addition to the Memory我想知道除了内存之外我是否可以将这些状态保存在磁盘上

Stateful streaming in Spark automatically get serialized to persistent storage, this is called checkpointing . Spark 中的有状态流自动序列化为持久存储,这称为检查点 When you run your stateful DStream, you must provide a checkpoint directory otherwise the graph won't be able to execute at runtime.运行有状态 DStream 时,必须提供检查点目录,否则图形将无法在运行时执行。

You can set the checkpointing interval via DStream.checkpoint .您可以通过DStream.checkpoint设置检查点间隔。 For example, if you want to set it to every 30 seconds:例如,如果要将其设置为每 30 秒:

inputDStream
 .mapWithState(trackStateFunc)
 .checkpoint(Seconds(30))

Accourding to "MapWithState" sources you can try:根据“MapWithState”来源,您可以尝试:

mapWithStateDS.dependencies.head.persist(StorageLevel.MEMORY_AND_DISK)

actual for spark 3.0.1实际用于火花 3.0.1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM