[英]Spark streaming: Cache DStream results across batches
Using Spark streaming (1.6) I have a filestream for reading lookup data with 2s of batch size, however files are copyied to the directory only every hour. 使用Spark流(1.6),我有一个文件流,用于读取批处理大小为2s的查找数据,但是文件仅每小时复制一次到目录中。
Once there's a new file, its content is read by the stream, this is what I want to cache into memory and keep there until new files are read. 一旦有了一个新文件,它的内容就会被流读取,这就是我想要缓存到内存中并保持在那里直到读取新文件的内容。
There's another stream to which I want to join this dataset therefore I'd like to cache. 我想将这个数据集加入另一个流,因此我想缓存。
This is a follow-up question of Batch lookup data for Spark streaming . 这是Spark Streaming的批量查找数据的后续问题。
The answer does work fine with updateStateByKey
however I don't know how to deal with cases when a KV pair is deleted from the lookup files, as the Sequence of values in updateStateByKey
keeps growing. 答案确实适用于
updateStateByKey
但是我不知道如何处理从查找文件中删除 KV对的情况,因为updateStateByKey
的值序列不断增长。 Also any hint how to do this with mapWithState
would be great. 任何关于如何使用
mapWithState
做到这一点的提示也将是很棒的。
This is what I tried so far, but the data doesn't seem to be persisted: 到目前为止,这是我尝试过的方法,但是数据似乎并没有持久化:
val dictionaryStream = ssc.textFileStream("/my/dir")
dictionaryStream.foreachRDD{x =>
if (!x.partitions.isEmpty) {
x.unpersist(true)
x.persist()
}
}
DStreams
can be persisted directly using persist
method which persist every RDD in the stream: DStreams
可以使用persist
方法直接进行persist
,该方法可以持久化流中的每个RDD:
dictionaryStream.persist
According to the official documentation this applied automatically for 根据官方文件,这自动适用于
window-based operations like
reduceByWindow
andreduceByKeyAndWindow
and state-based operations likeupdateStateByKey
基于窗口的操作(如
reduceByWindow
和reduceByKeyAndWindow
以及基于状态的操作(如updateStateByKey
so there should be no need for explicit caching in your case. 因此,在您的情况下,无需显式缓存。 Also there is no need for manual unpersisting.
同样,也不需要手动持久。 To quote the docs once again:
再次引用文档 :
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared
默认情况下,将自动清除DStream转换生成的所有输入数据和持久性RDD
and a retention period is tuned automatically based on the transformations which are used in the pipeline. 保留期会根据管道中使用的转换自动调整。
Regarding mapWithState
you'll have to provide a StateSpec
. 关于
mapWithState
您必须提供StateSpec
。 A minimal example requires a functions which takes key
, Option
of current value
and previous state. 一个最小的示例需要一个函数,该函数采用
key
,当前value
和先前状态的Option
。 Lets say you have DStream[(String, Long)]
and you want to record maximum value so far: 假设您有
DStream[(String, Long)]
并且想要记录到目前为止的最大值:
val state = StateSpec.function(
(key: String, current: Option[Double], state: State[Double]) => {
val max = Math.max(
current.getOrElse(Double.MinValue),
state.getOption.getOrElse(Double.MinValue)
)
state.update(max)
(key, max)
}
)
val inputStream: DStream[(String, Double)] = ???
inputStream.mapWithState(state).print()
It is also possible to provide initial state, timeout interval and capture current batch time. 还可以提供初始状态,超时间隔并捕获当前批处理时间。 The last two can be used to implement removal strategy for the keys which haven't been update for some period of time.
后两个可用于对一段时间未更新的密钥实施删除策略。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.