mapGroupsWithState 的 Spark 结构化流 state 存储在哪里？

Question

I know that the state is persisted at the checkpoint location as the state store.我知道 state 作为 state 存储保存在检查点位置。 but I don't know while it's still in memory, where it's stored?但我不知道它还在memory，它存储在哪里？

I created a Streaming job that uses mapGroupsWithState, but I see that storage memory used by executors is 0.我创建了一个使用 mapGroupsWithState 的 Streaming 作业，但我看到执行程序使用的存储 memory 为 0。

Does this mean that the state is stored in the execution memory?这是否意味着 state 存储在执行 memory 中？ I can't know the amount of memory consumed by the state.我不知道 state 消耗的 memory 的数量。 not sure how to know if I need to increase the executor memory or not!不知道如何知道我是否需要增加执行器 memory ！

Also, is it possible to avoid checkpointing of the state at all and keep the state always in memory?此外，是否可以完全避免 state 的检查点并将 state 始终保留在 memory 中？

Answer 1

As mapGroupsWithState is an aggregation it will be stored where all aggregations are kept within the lifetime of a Spark application: In the Execution Memory (as you have already assumed).由于 mapGroupsWithState 是一个聚合，它将存储在所有聚合都保存在 Spark 应用程序的生命周期内的位置：在执行 Memory 中（正如您已经假设的那样）。

Looking at the signature of the method查看方法的签名

def mapGroupsWithState[S: Encoder, U: Encoder](
      func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]

you will notice that S is the type of the user-defined state.您会注意到S是用户定义的 state 的类型。 And this is where the state is managed.这就是管理 state 的地方。

As this will be sent to the executors it must be encodable to Spark SQL types.由于这将被发送给执行程序，因此它必须可编码为 Spark SQL 类型。 Therefore, you would typically use a case class in Scala or a Bean in Java.因此，您通常会在 Scala 中使用案例 class 或在 Java 中使用 Bean。 The GroupState is a typed wrapper object that provides methods to access and manage the state value. GroupState是一个类型化的包装器 object，它提供了访问和管理 state 值的方法。

It is crucial that you, as developer, also take care of how data gets removed from this state.作为开发人员，您还必须注意如何从 state 中删除数据。 Otherwise, your state will inevitably cause an OOM as it will only grow and never shrink.否则，您的 state 将不可避免地导致 OOM，因为它只会增长而不会缩小。

If you do not enable checkpointing in your structured stream then nothing is stored but you have the drawback to loose your state during a failure.如果您未在结构化 stream 中启用检查点，则不会存储任何内容，但您的缺点是在故障期间会丢失 state。 In case you have enabled checkpointing, eg to keep track of the input source, Spark will also store the current state into the checkpoint location.如果您启用了检查点，例如跟踪输入源，Spark 还会将当前 state 存储到检查点位置。

Answer 2

If you enable checkpointing, the states are stored in State Store.如果启用检查点，则状态存储在 State 存储中。 By default its a HDFSBackedStateStore but that can be overriden too.默认情况下它是一个 HDFSBackedStateStore 但也可以被覆盖。 A good read on this would be https://medium.com/@polarpersonal/state-storage-in-spark-structured-streaming-e5c8af7bf509一个很好的阅读将是https://medium.com/@polarpersonal/state-storage-in-spark-structured-streaming-e5c8af7bf509

As the other answer already mentioned, if you dont enable checkpointing you will loose fault-tolerance and at-least once guarantees.正如已经提到的另一个答案，如果您不启用检查点，您将失去容错性和至少一次保证。

mapGroupsWithState 的 Spark 结构化流 state 存储在哪里？

问题描述

2 个解决方案

解决方案1
3 2021-02-25 19:52:02

解决方案2
0 2021-03-15 06:44:44

mapGroupsWithState 的 Spark 结构化流 state 存储在哪里？

问题描述

2 个解决方案

解决方案1 3 2021-02-25 19:52:02

解决方案2 0 2021-03-15 06:44:44

解决方案1
3 2021-02-25 19:52:02

解决方案2
0 2021-03-15 06:44:44