[英]How to Initialize the State of DStream in a spark streaming application
I have a spark-streaming application that basically keeps track of a string->string dictionary. 我有一个火花流应用程序,基本上可以跟踪string-> string字典。
So I have messages coming in with updates, like: 因此,我收到一些带有更新的消息,例如:
“A”->”B” “A” - >” B”
And I need to update the dictionary. 我需要更新字典。
This seems like a simple use case for the updateStateByKey method. 这似乎是updateStateByKey方法的一个简单用例。
However, my issue is that when the app starts I need to “initialize” the dictionary with data from a hive table, that has all the historical key/values with the dictionary. 但是,我的问题是,当应用启动时,我需要使用配置单元表中的数据来“初始化”字典,该表具有字典中的所有历史键/值。
The only way I could think of is doing something like: 我能想到的唯一方法是执行以下操作:
val rdd =… //get data from hive
def process(input: DStream[(String, String)]) = {
input.join(rdd).updateStateByKey(update)
}
So the join operation will be done on every incoming buffer, where in fact I only need it on initialization. 因此,将对每个传入缓冲区执行加入操作,实际上,我仅在初始化时需要它。
Any idea how to achieve that? 知道如何实现吗?
Thanks 谢谢
PairDStreamFunctions.updateStateByKey
有一个过载,它接受一个InitialRDD ,这似乎是您需要的:
updateStateByKey[S](updateFunc: (Seq[V], Option[S]) ⇒ Option[S], partitioner: Partitioner, initialRDD: RDD[(K, S)])(implicit arg0: ClassTag[S]): DStream[(K, S)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.