[英]How does Spark Structured Streaming flush in-memory state when state data is no longer being checked?
I am trying to build a sessionization application with Spark Structured Streaming(version 2.2.0)
. 我正在尝试使用
Spark Structured Streaming(version 2.2.0)
构建会话应用程序。
In case of using mapGroupWithState
with Update mode, I understand that the executor will crash with an OOM exception if the state data grows large. 如果使用
mapGroupWithState
和Update模式,我知道如果状态数据变大,执行程序将崩溃并出现OOM异常。 Hence, I have to manage the memory with GroupStateTimeout
option. 因此,我必须使用
GroupStateTimeout
选项管理内存。 (Ref. How does Spark Structured Streaming handle in-memory state when state data is growing? ) (参考。 当状态数据增长时,Spark Structured Streaming如何处理内存状态? )
However, I can't check if the state is timed-out and ready to be removed if there is no more new streaming data for the particular keys. 但是,如果没有更多特定键的新流数据,我无法检查状态是否超时并准备好被删除。
For example, let's say I have the following code. 例如,假设我有以下代码。
myDataset
.groupByKey(_.key)
.flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.EventTimeTimeout)(makeSession)
makeSession() function will check if the state is timed-out and remove the timed-out state. makeSession()函数将检查状态是否超时并删除超时状态。
Now, let's say the key "foo" has some stored state in memory already, and no new data with the key "foo" is streaming into the application. 现在,假设密钥“foo”已经在内存中存储了一些状态,并且没有带有密钥“foo”的新数据流入应用程序。 As a result, makeSession() does not process the data with key "foo" and the stored state is not being checked.
因此,makeSession()不会使用键“foo”处理数据,并且不会检查存储的状态。 Which means, the stored state with key "foo" persists in memory.
这意味着,具有键“foo”的存储状态在内存中持续存在。 If there are many keys like "foo", the stored states will not be flushed and JVM will raise OOM exception.
如果有许多键如“foo”,则不会刷新存储的状态,JVM将引发OOM异常。
I might be misunderstanding with mapGroupWithState
, but I suspect my OOM exception is caused by the above issue. 我可能对
mapGroupWithState
有误解,但我怀疑我的OOM异常是由上述问题引起的。
If I am correct, what would be the solution for this case? 如果我是对的,那么这个案例的解决方案是什么? I want to flush all the stored states that has been timedout and have no more new streaming data.
我想刷新已经超时的所有存储状态,并且不再有新的流数据。
Is there any good code example? 有没有好的代码示例?
Now, let's say the key "foo" has some stored state in memory already, and no new data with the key "foo" is streaming into the application.
现在,假设密钥“foo”已经在内存中存储了一些状态,并且没有带有密钥“foo”的新数据流入应用程序。 As a result, makeSession() does not process the data with key "foo" and the stored state is not being checked.
因此,makeSession()不会使用键“foo”处理数据,并且不会检查存储的状态。
This is incorrect. 这是不正确的。 As long as you have new data for any key , Spark will make sure that each batch validates the entire key set, and invoke the timed out keys one last time.
只要您拥有任何密钥的新数据,Spark就会确保每个批次验证整个密钥集,并最后一次调用超时密钥。
A part of every call to flat/mapGroupsWithState
, we have: 每次调用
flat/mapGroupsWithState
,我们有:
val outputIterator =
updater.updateStateForKeysWithData(filteredIter) ++
updater.updateStateForTimedOutKeys()
And this is updateStateForTimedOutKeys
: 这是
updateStateForTimedOutKeys
:
def updateStateForTimedOutKeys(): Iterator[InternalRow] = {
if (isTimeoutEnabled) {
val timeoutThreshold = timeoutConf match {
case ProcessingTimeTimeout => batchTimestampMs.get
case EventTimeTimeout => eventTimeWatermark.get
case _ =>
throw new IllegalStateException(
s"Cannot filter timed out keys for $timeoutConf")
}
val timingOutKeys = store.filter { case (_, stateRow) =>
val timeoutTimestamp = getTimeoutTimestamp(stateRow)
timeoutTimestamp != NO_TIMESTAMP && timeoutTimestamp < timeoutThreshold
}
timingOutKeys.flatMap { case (keyRow, stateRow) =>
callFunctionAndUpdateState(keyRow, Iterator.empty, Some(stateRow), hasTimedOut = true)
}
} else Iterator.empty
}
Where the relevant part is flatMap over the timed out keys and invoking each function one last time with hasTimedOut = true
. 其中相关部分是time-out键上的flatMap,并且最后一次使用
hasTimedOut = true
调用每个函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.