当不再检查状态数据时，Spark Structured Streaming如何刷新内存状态？

Question

I am trying to build a sessionization application with Spark Structured Streaming(version 2.2.0) . 我正在尝试使用Spark Structured Streaming(version 2.2.0)构建会话应用程序。

In case of using mapGroupWithState with Update mode, I understand that the executor will crash with an OOM exception if the state data grows large. 如果使用mapGroupWithState和Update模式，我知道如果状态数据变大，执行程序将崩溃并出现OOM异常。 Hence, I have to manage the memory with GroupStateTimeout option. 因此，我必须使用GroupStateTimeout选项管理内存。 (Ref. How does Spark Structured Streaming handle in-memory state when state data is growing? ) （参考。当状态数据增长时，Spark Structured Streaming如何处理内存状态？）

However, I can't check if the state is timed-out and ready to be removed if there is no more new streaming data for the particular keys. 但是，如果没有更多特定键的新流数据，我无法检查状态是否超时并准备好被删除。

For example, let's say I have the following code. 例如，假设我有以下代码。

myDataset
  .groupByKey(_.key)
  .flatMapGroupsWithState(OutputMode.Update, GroupStateTimeout.EventTimeTimeout)(makeSession)

makeSession() function will check if the state is timed-out and remove the timed-out state. makeSession（）函数将检查状态是否超时并删除超时状态。

Now, let's say the key "foo" has some stored state in memory already, and no new data with the key "foo" is streaming into the application. 现在，假设密钥“foo”已经在内存中存储了一些状态，并且没有带有密钥“foo”的新数据流入应用程序。 As a result, makeSession() does not process the data with key "foo" and the stored state is not being checked. 因此，makeSession（）不会使用键“foo”处理数据，并且不会检查存储的状态。 Which means, the stored state with key "foo" persists in memory. 这意味着，具有键“foo”的存储状态在内存中持续存在。 If there are many keys like "foo", the stored states will not be flushed and JVM will raise OOM exception. 如果有许多键如“foo”，则不会刷新存储的状态，JVM将引发OOM异常。

I might be misunderstanding with mapGroupWithState , but I suspect my OOM exception is caused by the above issue. 我可能对mapGroupWithState有误解，但我怀疑我的OOM异常是由上述问题引起的。

If I am correct, what would be the solution for this case? 如果我是对的，那么这个案例的解决方案是什么？ I want to flush all the stored states that has been timedout and have no more new streaming data. 我想刷新已经超时的所有存储状态，并且不再有新的流数据。

Is there any good code example? 有没有好的代码示例？

Answer 1

Now, let's say the key "foo" has some stored state in memory already, and no new data with the key "foo" is streaming into the application. 现在，假设密钥“foo”已经在内存中存储了一些状态，并且没有带有密钥“foo”的新数据流入应用程序。 As a result, makeSession() does not process the data with key "foo" and the stored state is not being checked. 因此，makeSession（）不会使用键“foo”处理数据，并且不会检查存储的状态。

This is incorrect. 这是不正确的。 As long as you have new data for any key , Spark will make sure that each batch validates the entire key set, and invoke the timed out keys one last time. 只要您拥有任何密钥的新数据，Spark就会确保每个批次验证整个密钥集，并最后一次调用超时密钥。

A part of every call to flat/mapGroupsWithState , we have: 每次调用flat/mapGroupsWithState ，我们有：

val outputIterator =
      updater.updateStateForKeysWithData(filteredIter) ++
      updater.updateStateForTimedOutKeys()

And this is updateStateForTimedOutKeys : 这是updateStateForTimedOutKeys ：

def updateStateForTimedOutKeys(): Iterator[InternalRow] = {
  if (isTimeoutEnabled) {
    val timeoutThreshold = timeoutConf match {
      case ProcessingTimeTimeout => batchTimestampMs.get
      case EventTimeTimeout => eventTimeWatermark.get
      case _ =>
        throw new IllegalStateException(
          s"Cannot filter timed out keys for $timeoutConf")
    }
    val timingOutKeys = store.filter { case (_, stateRow) =>
      val timeoutTimestamp = getTimeoutTimestamp(stateRow)
      timeoutTimestamp != NO_TIMESTAMP && timeoutTimestamp < timeoutThreshold
    }
    timingOutKeys.flatMap { case (keyRow, stateRow) =>
      callFunctionAndUpdateState(keyRow, Iterator.empty, Some(stateRow), hasTimedOut = true)
    }
  } else Iterator.empty
}

Where the relevant part is flatMap over the timed out keys and invoking each function one last time with hasTimedOut = true . 其中相关部分是time-out键上的flatMap，并且最后一次使用hasTimedOut = true调用每个函数。

当不再检查状态数据时，Spark Structured Streaming如何刷新内存状态？

问题描述

1 个解决方案

解决方案1
2 2018-03-07 10:41:23

当不再检查状态数据时，Spark Structured Streaming如何刷新内存状态？

问题描述

1 个解决方案

解决方案1 2 2018-03-07 10:41:23

解决方案1
2 2018-03-07 10:41:23