简体繁体 English

在Spark Streaming中使用Java对有序的Spark流进行迭代编程？

[英]Iterative programming on an ordered spark stream using Java in Spark Streaming?

原文 2015-06-17 23:40:15 7 1 java/ apache-spark/ spark-streaming

Is there anyway in spark streaming to keep data across multiple micro-batches of a sorted dstream, where the stream is sorted using timestamps? 无论如何，火花流中是否可以将数据保留在已排序的dstream的多个微批处理中，其中使用时间戳对流进行排序？ (Assuming monotonically arriving data) Can anyone make suggestions on how to keep data across iterations where each iteration is an RDD being processed in JavaDStream? （假设数据是单调到达的）有人可以提出建议，如何在每次迭代中都保留JavaDStream中处理的RDD的迭代之间保留数据？

What does iteration mean? 迭代是什么意思？

I first sort the dstream using timestamps, assuming that data has arrived in a monotonically increasing timestamp (no out-of-order). 我首先使用时间戳对dstream进行排序，并假设数据以单调递增的时间戳到达（无序）。

I need a global HashMap X, which I would like to be updated using values with timestamp "t1", and then subsequently "t1+1". 我需要一个全局HashMap X，我希望使用时间戳为“ t1”的值进行更新，然后再使用“ t1 + 1”的值进行更新。 Since the state of X itself impacts the calculations it needs to be a linear operation. 由于X的状态本身会影响计算，因此它必须是线性运算。 Hence operation at "t1+1" depends on HashMap X, which depends on data at and before "t1". 因此，在“ t1 + 1”的操作取决于HashMap X，后者取决于“ t1”及之前的数据。

Application 应用

This is especially the case when one is trying to update a model or compare two sets of RDD's, or keep a global history of certain events etc which will impact operations in future iterations? 当人们尝试更新模型或比较两组RDD或保留某些事件的全局历史记录等会影响未来迭代操作的情况时，尤其如此。

I would like to keep some accumulated history to make calculations.. not the entire dataset, but persist certain events which can be used in future DStream RDDs? 我想保留一些累积的历史记录以进行计算..而不是整个数据集，而是保留某些可在将来的DStream RDD中使用的事件？

1 个解决方案

UpdateStateByKey does precisely that: it enables you to define some state, as well as a function to update it based on each RDD in your stream. UpdateStateByKey正是这样做的：它使您可以定义一些状态，以及根据流中的每个RDD更新它的函数。 This is the typical way to accumulate historical computations over time. 这是随时间累积历史计算的典型方法。

From the doc: 从文档中：

The updateStateByKey operation allows you to maintain arbitrary state while continuously updating it with new information. updateStateByKey操作使您可以保持任意状态，同时用新信息连续更新它。 To use this, you will have to do two steps. 要使用此功能，您将必须执行两个步骤。

Define the state - The state can be of arbitrary data type. 定义状态-状态可以是任意数据类型。

Define the state update function - Specify with a function how to update the state using the previous state and the new values from input stream. 定义状态更新功能-使用功能指定如何使用输入流中的先前状态和新值来更新状态。

More info here: https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#updatestatebykey-operation 此处提供更多信息： https : //spark.apache.org/docs/1.4.0/streaming-programming-guide.html#updatestatebykey-operation

If this does not cut it or you need more flexibility, you can always store to a key-value store explicitly like Cassandra (cf Cassandra connector: https://github.com/datastax/spark-cassandra-connector ), although this option is typically slower since it systematically involves network transfer at every lookup.. 如果这样做不能解决问题或需要更大的灵活性，则始终可以像Cassandra一样显式地存储到键值存储中（参见Cassandra连接器： https : //github.com/datastax/spark-cassandra-connector ），尽管该选项通常会比较慢，因为它会在每次查找时系统地涉及网络传输。