Kubernetes的火花。 Kubernetes如何保持有状态的火花性质？

Question

I am experimenting Spark2.3 on a K8s cluster. 我正在K8s集群上试验Spark2.3。 Wondering how does the checkpoint work? 想知道检查站是如何工作的？ Where is it stored? 它存放在哪里？ If the main driver dies, what happens to the existing processing? 如果主驱动器死了，现有的处理会发生什么？

In case of consuming from Kafka, how does the offset maintained? 如果从卡夫卡消费，抵消如何保持？ I tried to lookup online but could not find any answer to those questions. 我尝试在线查找，但找不到这些问题的任何答案。 Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped. 我们的应用程序消耗了大量的Kafka数据，因此必须能够重新启动并从停止的地方获取数据。

Any gotchas on running Spark Streaming on K8s? 在K8上运行Spark Streaming的任何问题？

Answer 1

The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. Kubernetes Spark控制器对检查点AFAIK一无所知。 It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job. 这只是Kubernetes安排Spark驱动程序和工作人员运行所需工作的一种方式。

Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. 存储偏移量实际上取决于您的应用程序以及您希望存储Kafka偏移量的位置，因此当它重新启动时，它会拾取该偏移并从那里开始消耗。 This is an example on how to store it in Zookeeper. 这是关于如何将其存储在Zookeeper中的示例。

You could, for example, write ZK offset manager functions in Scala: 例如，您可以在Scala中编写ZK偏移管理器函数：

import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {

  def getOffsets(client: CuratorFramework,
                 ... = {

  }

  def setOffsets(client: CuratorFramework,
                 ... = {

  }
  ...

Another way would be storing your Kafka offsets in something reliable like HDFS . 另一种方法是将你的Kafka偏移存储在像HDFS这样可靠的东西中。

Kubernetes的火花。 Kubernetes如何保持有状态的火花性质？

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-11-14 22:50:25

Kubernetes的火花。 Kubernetes如何保持有状态的火花性质？

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-11-14 22:50:25

解决方案1
3 已采纳 2018-11-14 22:50:25