简体繁体 English

当状态数据增长时，Spark Structured Streaming如何处理内存状态？

[英]How does Spark Structured Streaming handle in-memory state when state data is growing?

原文 2017-08-22 23:22:19 9 2 apache-spark/ spark-structured-streaming

In Spark Structured Streaming (version 2.2.0), in case of using mapGroupsWithState query with Update mode as the output mode, It seems that Spark is storing the in-memory state data using java.util.ConcurrentHashMap data structure. 在Spark Structured Streaming（版本2.2.0）中，如果使用带有Update模式的 mapGroupsWithState查询作为输出模式，似乎Spark使用java.util.ConcurrentHashMap数据结构存储内存中的状态数据。 Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore? 有人可以向我详细解释当状态数据增长并且内存不足时会发生什么？ Also, is it possible to change the limit for storing the state data in the memory, using a spark config parameter? 此外，是否可以使用spark配置参数更改将状态数据存储在内存中的限制？

2 个解决方案

Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore 有人可以向我详细解释当状态数据增长并且内存不足时会发生什么

The executor will crash with an OOM exception. 执行程序将因OOM异常而崩溃。 Since with mapGroupWithState , you're the one in charge of adding and removing state, if you're overwhelming the JVM with data you can't allocate memory for, the process will crash. 由于使用mapGroupWithState ，您是负责添加和删除状态的人，如果您使用无法为其分配内存的数据压倒JVM，则进程将崩溃。

is it possible to change the limit for storing the state data in the memory, using a spark config parameter? 是否可以使用spark配置参数更改将状态数据存储在内存中的限制？

It isn't possible to limit the number of bytes you're storing in memory. 无法限制存储在内存中的字节数。 Again, if this is mapGroupsWithState , you have to manage state in such a way that won't cause your JVM to OOM, such as setting timeouts and removing state. 同样，如果这是mapGroupsWithState ，则必须以不会导致JVM到OOM的方式管理状态，例如设置超时和删除状态。 If we're talking about stateful aggregations where Spark manages the state for you, such as the agg combinator, then you can limit the state using a watermark which will evict old data from memory once the time frame passes. 如果我们谈论Spark为你管理状态的有状态聚合，例如agg组合器，那么你可以使用水印来限制状态，一旦时间帧过去，水印将从内存中驱逐旧数据。

The existing State Store implementation uses in-memory HashMaps (for Storage) + HDFS (for fault tolerance) The HashMaps are versioned (one per micro-batch). 现有的State Store实现使用内存中的HashMaps（用于存储）+ HDFS（用于容错）HashMaps是版本化的（每个微批次一个）。 There is one separate map of key-value for each version of every aggregated partition in the executor memory of the worker. 对于工作程序的执行程序内存中的每个聚合分区的每个版本，都有一个单独的键值映射。 (number of versions to maintain is configurable) To answer your question: （要维护的版本数量是可配置的）要回答您的问题：

Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore. 有人可以向我详细解释当状态数据增长并且内存不足时会发生什么。

The state store HashMaps shares the executor memory with shuffle tasks. 状态存储HashMaps与shuffle任务共享执行程序内存。 So as state grows or shuffle tasks need more memory, frequent GCs and OOMs will happen leading executor failures. 因此，当状态增长或随机播放任务需要更多内存时，频繁的GC和OOM将发生导致执行器失败。

is it possible to change the limit for storing the state data in the memory, using a spark config parameter? 是否可以使用spark配置参数更改将状态数据存储在内存中的限制？

Currently that is not possible. 目前这是不可能的。 You can only specify executor memory which will be shared by both state store and executor tasks, there is no way we can divide memory between them. 您只能指定将由状态存储和执行程序任务共享的执行程序内存，我们无法在它们之间划分内存。 This actually makes the current implementation unreliable in case of sudden data outbursts, even watermarks will not be helpful in those cases. 这实际上使得当前实现在突然数据突发的情况下不可靠，甚至水印在这些情况下也没有帮助。
In case interested to know how the state store works internally in structured streaming, this article might be useful: https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/ 如果有兴趣了解状态商店如何在结构化流媒体内部工作，本文可能会有用： https ： //www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/