如何在Spark Streaming应用程序中初始化DStream的状态

Question

I have a spark-streaming application that basically keeps track of a string->string dictionary. 我有一个火花流应用程序，基本上可以跟踪string-> string字典。

So I have messages coming in with updates, like: 因此，我收到一些带有更新的消息，例如：

“A”->”B” “A” - >” B”

And I need to update the dictionary. 我需要更新字典。

This seems like a simple use case for the updateStateByKey method. 这似乎是updateStateByKey方法的一个简单用例。

However, my issue is that when the app starts I need to “initialize” the dictionary with data from a hive table, that has all the historical key/values with the dictionary. 但是，我的问题是，当应用启动时，我需要使用配置单元表中的数据来“初始化”字典，该表具有字典中的所有历史键/值。

The only way I could think of is doing something like: 我能想到的唯一方法是执行以下操作：

val rdd =… //get data from hive
def process(input: DStream[(String, String)]) = {
    input.join(rdd).updateStateByKey(update)
}

So the join operation will be done on every incoming buffer, where in fact I only need it on initialization. 因此，将对每个传入缓冲区执行加入操作，实际上，我仅在初始化时需要它。

Any idea how to achieve that? 知道如何实现吗？

Thanks 谢谢

Answer 1

PairDStreamFunctions.updateStateByKey有一个过载，它接受一个InitialRDD ，这似乎是您需要的：

updateStateByKey[S](updateFunc: (Seq[V], Option[S]) ⇒ Option[S], partitioner: Partitioner, initialRDD: RDD[(K, S)])(implicit arg0: ClassTag[S]): DStream[(K, S)]

如何在Spark Streaming应用程序中初始化DStream的状态

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-03-10 22:04:08

如何在Spark Streaming应用程序中初始化DStream的状态

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-03-10 22:04:08

解决方案1
2 已采纳 2016-03-10 22:04:08