简体   繁体   English

Spark结构化流自定义StateStoreProvide

[英]Spark Structured Streaming custom StateStoreProvide

By default Structured Streaming job is using HDFSStateStoreProvide . 默认情况下,结构化流作业使用HDFSStateStoreProvide The issue with using HDFS store is that it is not scalable. 使用HDFS存储的问题在于它不可扩展。 When the job get more data from the kafka during high traffic hours, it fails due following error: 在繁忙时间,当作业从kafka获取更多数据时,由于以下错误,该作业失败:

18/12/06 15:54:35 ERROR scheduler.TaskSetManager: Task 191 in stage 231.0 failed 4 times; aborting job
18/12/06 15:54:35 ERROR streaming.StreamExecution: Query eventQuery [id = 42051afe-b1bc-438d-8143-2d7e5def717c, runId = 6201c769-b115-4b92-bad5-450b8803b88b] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 191 in stage 231.0 failed 4 times, most recent failure: Lost task 191.3 in stage 231.0 (TID 24016, sparkstreamingc1n5.host.bo1.csnzoo.com, executor 659): java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
    at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
    at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
    at scala.Option.getOrElse(Option.scala:121)

How to configure a custom state store provide? 如何配置自定义状态存储提供?

For testing purposes I tried adding a fake class 为了测试,我尝试添加一个假类

--conf spark.sql.streaming.stateStore.providerClass=com.streaming.state.RocksDBStateStoreProvider

But the job is still picking the HDFSStateStoreProvider even when this class doesn't exist. 但是,即使此类不存在,该工作仍在选择HDFSStateStoreProvider。 Is this a expected behavior? 这是预期的行为吗?

Can I use any key value database to write the custom state provider? 我可以使用任何键值数据库编写自定义状态提供程序吗?

Or it is only limited to RocksDB and Cassandra . 或仅限于RocksDBCassandra

How to configure a custom state store provide? 如何配置自定义状态存储提供?

Your approach for configuring custom state store provider looks correct, but you can't change state store provider once you ran query before. 您配置自定义状态存储提供程序的方法看起来很正确,但是一旦运行查询就无法更改状态存储提供程序。 (Spark will read the configuration from metadata in checkpoint.) This restriction makes sense since when changing state store provider, state is not guaranteed to be restored. (Spark将从检查点中的元数据中读取配置。)此限制是有道理的,因为在更改状态存储提供程序时,不能保证恢复状态。

Can I use any key value database to write the custom state provider? 我可以使用任何键值数据库编写自定义状态提供程序吗?

There's no specific restriction, once your custom state provider implements spec on state store provider. 一旦您的自定义状态提供程序实现了状态存储提供程序的规范,就没有具体限制。 Two major things to consider are 1. Spark will checkpoint the change for every batch 2. Spark requires state store provider to restore state at specific version. 需要考虑两个主要事项:1. Spark将为每个批次检查更改。2. Spark需要状态存储提供程序以特定版本还原状态。 Your custom state provider should be performant - because it will add latency on every batch. 您的自定义状态提供程序应具有高性能-因为它将增加每个批次的延迟。

You may also want to consider (transitive) dependency being added to Spark application due to custom state provider. 您可能还需要考虑由于自定义状态提供程序而将(传递)依赖项添加到Spark应用程序中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM