简体繁体 English

在防火墙后将作业提交到Apache-Spark

[英]Submit jobs to Apache-Spark while being behind a firewall

原文 2017-06-05 16:48:15 6 1 python/ apache-spark

Usecase: I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me. 用例：我在防火墙后面，并且可以访问远程Spark集群，但是这些机器无法直接连接到我。

As Spark doc states it is necessary for the worker to be able to reach the driver program: 正如Spark doc所述，工作人员必须能够访问驱动程序：

Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. 由于驱动程序在群集上调度任务，因此应在工作节点附近运行，最好在同一局域网上运行。 If you'd like to send requests to the cluster remotely, it's better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes. 如果您想将请求远程发送到集群，最好是将RPC打开到驱动程序，并让它在附近提交操作，而不是在远离工作节点的地方运行驱动程序。

The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally. 建议的解决方案是让服务器进程在集群上运行，以侦听RPC，并使其在本地执行spark驱动程序。

Does such a program already exists? 这样的程序是否已经存在？ Such a process should manage 1+ RPC, returning exceptions and handling logs. 这样的过程应该管理1+ RPC，返回异常并处理日志。

Also in that case, is my local program or the spark driver who has to create the SparkContext? 同样在这种情况下，是我的本地程序还是必须创建SparkContext的spark驱动程序？

Note: I have a standalone cluster 注意：我有一个独立的集群

Solution1 : 解决方案1 ：

A simple way would be to use cluster mode (similar to --deploy-mode cluster ) for the standalone cluster, however the doc says: 一种简单的方法是对独立集群使用集群模式（类似于--deploy-mode cluster ），但是该文档说：

Currently, standalone mode does not support cluster mode for Python applications. 当前，独立模式不支持Python应用程序的群集模式。

1 个解决方案

Just a few options: 只是几个选择：

Connect to the cluster node using ssh , start screen , submit Spark application, go back to check the results. 使用ssh连接到集群节点，开始screen ，提交Spark应用程序，返回以检查结果。
Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions. 在您的集群上部署Job Server，Livy或Mist之类的中间件，并将其用于提交。
Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook. 在您的群集上部署笔记本（Zeppelin，Toree），然后从笔记本提交应用程序。
Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress . 设置固定的spark.driver.port然后ssh通过其群集节点之一的IP作为spark.driver.bindAddress转发所有连接。