简体   繁体   English

如何在Spark Streaming应用程序中初始化DStream的状态

[英]How to Initialize the State of DStream in a spark streaming application

I have a spark-streaming application that basically keeps track of a string->string dictionary. 我有一个火花流应用程序,基本上可以跟踪string-> string字典。

So I have messages coming in with updates, like: 因此,我收到一些带有更新的消息,例如:

“A”->”B” “A” - >” B”

And I need to update the dictionary. 我需要更新字典。

This seems like a simple use case for the updateStateByKey method. 这似乎是updateStateByKey方法的一个简单用例。

However, my issue is that when the app starts I need to “initialize” the dictionary with data from a hive table, that has all the historical key/values with the dictionary. 但是,我的问题是,当应用启动时,我需要使用配置单元表中的数据来“初始化”字典,该表具有字典中的所有历史键/值。

The only way I could think of is doing something like: 我能想到的唯一方法是执行以下操作:

val rdd =… //get data from hive
def process(input: DStream[(String, String)]) = {
    input.join(rdd).updateStateByKey(update)
}

So the join operation will be done on every incoming buffer, where in fact I only need it on initialization. 因此,将对每个传入缓冲区执行加入操作,实际上,我仅在初始化时需要它。

Any idea how to achieve that? 知道如何实现吗?

Thanks 谢谢

PairDStreamFunctions.updateStateByKey有一个过载,它接受一个I​​nitialRDD ,这似乎是您需要的:

updateStateByKey[S](updateFunc: (Seq[V], Option[S]) ⇒ Option[S], partitioner: Partitioner, initialRDD: RDD[(K, S)])(implicit arg0: ClassTag[S]): DStream[(K, S)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM