我应该使用状态计算吗？ Spark Streaming状态计算说明

Question

Here is my case: I receive data from different devices, which have their own signature, a timestamp and a flag. 这是我的情况：我从不同的设备接收数据，这些设备具有自己的签名，时间戳和标志。 I then filter the (flag==SAVE_VALUE) in a file using a foreachRDD function, but only if it passes this condition: 然后，我使用foreachRDD函数过滤文件中的(flag==SAVE_VALUE) ，但foreachRDD是它通过以下条件：

(it is the first time I receive this signature)
OR
(I already have this signature && the timestamp is older than an hour)

This, until I was in a local environment, meant for me to use a Map, where I stored all the IDs and the last timestamp received. 在我处于本地环境之前，这意味着我可以使用地图，在该地图上我存储了所有ID和收到的最后时间戳。 Right now I would like to move this logic in a Spark like one. 现在，我想像这样在Spark中移动这种逻辑。 How should I do it? 我该怎么办？
I feel this is a case for a stateful Dstream, but I cannot completely understand: 我觉得这是有状态Dstream的一种情况，但我无法完全理解：

How should I store a map-like rdd in a Dstream? 如何在Dstream中存储类似地图的rdd？ Or how do I create a single " map RDD " 或如何创建单个“ 地图RDD ”
How do I compare the new data arriving? 如何比较到达的新数据？

Answer 1

Have a look at mapWithState() , it is exactly what you want. 看看mapWithState() ，它正是您想要的。

In the StateSpecFunction , you can determine if you want to update, keep, or remove the current state, whenever a new value arrives for the same key. 在StateSpecFunction ，可以确定是否要更新，保留或删除当前状态，只要有相同键的新值到达。 You have access to both the current state and the new one, so you can do any type of comparison between the two. 您可以访问当前状态和新状态，因此可以在两者之间进行任何类型的比较。

It has also built-in support for timeouts, and can be partitioned to multiple executors. 它还具有对超时的内置支持，并且可以分区为多个执行程序。

You can access the global map by calling stateSnapshots() on the return value of mapWithState() . 你可以通过调用访问全球地图stateSnapshots()的返回值mapWithState() Otherwise the return value will be determined by the return values of your StateSpecFunction , per batch. 否则，返回值将由每批StateSpecFunction的返回值确定。

mapWithState() was added in Spark 1.6, before that there was a similar function called updateStateByKey() , which did mostly the same, but performed worse on larger datasets. 在Spark 1.6中添加了mapWithState() ，然后才有一个名为updateStateByKey()的相似函数，其功能大致相同，但在较大的数据集上表现较差。

我应该使用状态计算吗？ Spark Streaming状态计算说明

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-07-21 14:24:08

我应该使用状态计算吗？ Spark Streaming状态计算说明

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-07-21 14:24:08

解决方案1
1 已采纳 2016-07-21 14:24:08