简体繁体 English

Spark 2.3.1 Structured Streaming 状态存储内部工作

[英]Spark 2.3.1 Structured Streaming state store inner working

原文 2018-08-17 10:29:00 0 1 apache-spark/ spark-structured-streaming

I have been going through the documentation of spark 2.3.1 on structured streaming, but could not find details of how stateful operation works internally with the the state store.我一直在阅读关于结构化流的 spark 2.3.1 文档，但找不到有关有状态操作如何在状态存储内部工作的详细信息。 More specifically what i would like to know is, (1) is the state store distributed?更具体地说，我想知道的是，（1）国家商店是分布式的吗？ (2) if so then how, per worker or core ? (2) 如果是这样，那么每个工人或核心如何？

It seems like in previous version of Spark it was per worker but no idea for now.似乎在以前版本的 Spark 中，它是每个工人，但现在不知道。 I know that it is backed by HDFS, but nothing explained how the in-memory store actually works.我知道它是由 HDFS 支持的，但没有解释内存存储实际上是如何工作的。

Indeed is it a distributed in-memory store ?确实是分布式内存存储吗？ I am particularly interested in de-duplication, if data are stream from let say a large data set, then this need to be planned as the all "Distinct" DataSet will be ultimately held in memory as the end of the processing of that data set.我对重复数据删除特别感兴趣，如果数据是来自比方说大型数据集的流，那么需要进行计划，因为所有“不同”的数据集最终将在该数据集处理结束时保存在内存中. Hence one need to plan the size of the worker or master depending on how that state store work.因此，需要根据状态存储的工作方式来规划 worker 或 master 的大小。

1 个解决方案

There is only one implementation of State Store in Structured Streaming which is backed by In-memory HashMap and HDFS.结构化流中只有一种 State Store 实现，它由 In-memory HashMap 和 HDFS 支持。 While In-Memory HashMap is for data storage, HDFS is for fault rolerance. In-Memory HashMap 用于数据存储，而 HDFS 用于故障角色。 The HashMap occupies executor memory on the worker and each HashMap represents a versioned key-value data of aggregated partition (generated after aggregator operator like deduplication, groupByy, etc) HashMap 占用了 worker 上的 executor 内存，每个 HashMap 代表聚合分区的版本化键值数据（在聚合器操作后生成，如重复数据删除、groupByy 等）

But this does not explain how the HDFSBackedStateStore actually work.但这并不能解释 HDFSBackedStateStore 实际上是如何工作的。 i don't see it in the documentation我在文档中没有看到它

You are correct that there is no such documentation available.您是正确的，没有可用的此类文档。 I had to understand the code (2.3.1) , wrote an article on how State Store works internally in Structured Streaming.我必须理解代码 (2.3.1) ，写了一篇关于 State Store 如何在 Structured Streaming 内部工作的文章。 You might like to have a look : https://www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/你可能想看看： https : //www.linkedin.com/pulse/state-management-spark-structured-streaming-chandan-prakash/