I am just starting out with Apache Flink using Scala. Can someone please tell me how to create a lagged stream(lagged by k events or k units of time) from a current datastream that I have?
Basically, I want to implement an auto regression model (Linear regression on the stream with the time lagged version of itself) on a data-stream. So, a method is needed something similar to the following pseudo code.
val ds : DataStream = ...
val laggedDS : DataStream = ds.map(lag _)
def lag(ds : DataStream, k : Time) : DataStream = {
}
I expect the sample input and output like this if every event is spaced at 1 second interval and there is a 2 second lag.
Input : 1, 2, 3, 4, 5, 6, 7...
Output: NA, NA, 1, 2, 3, 4, 5...
Given that I your requirements right, I would implement this as a FlatMapFunction
with a FIFO queue. The queue buffers k
events and emits the head whenever a new event arrives. In case you need a fault tolerant streaming application, the queue must be registered as state. Flink will then take care of checkpointing the state (ie, the queue) and restore it in case of a failure.
The FlatMapFunction
could look like this:
class Lagger(val k: Int)
extends FlatMapFunction[X, X]
with Checkpointed[mutable.Queue[X]]
{
var fifo: mutable.Queue[X] = new mutable.Queue[X]()
override def flatMap(value: X, out: Collector[X]): Unit = {
// add new element to queue
fifo.enqueue(value)
if (fifo.size == k + 1) {
// remove head element and emit
out.collect(fifo.dequeue())
}
}
// restore state
override def restoreState(state: mutable.Queue[X]) = { fifo = state }
// get state to checkpoint
override def snapshotState(cId: Long, cTS: Long): mutable.Queue[X] = fifo
}
Returning elements with a time lag is more involved. This would require timer threads for the emission because the function is only called when a new element arrives.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.