简体   繁体   中英

Submit jobs to Apache-Spark while being behind a firewall

Usecase: I'm behind a firewall, and I have a remote spark cluster I can access to, however those machines cannot connect directly to me.

As Spark doc states it is necessary for the worker to be able to reach the driver program:

Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you'd like to send requests to the cluster remotely, it's better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

The suggested solution is to have a server process running on the cluster listening to RPC and let it execute itself the spark driver program locally.

Does such a program already exists? Such a process should manage 1+ RPC, returning exceptions and handling logs.

Also in that case, is my local program or the spark driver who has to create the SparkContext?

Note: I have a standalone cluster

Solution1 :

A simple way would be to use cluster mode (similar to --deploy-mode cluster ) for the standalone cluster, however the doc says:

Currently, standalone mode does not support cluster mode for Python applications.

Just a few options:

  • Connect to the cluster node using ssh , start screen , submit Spark application, go back to check the results.
  • Deploy middleware like Job Server, Livy or Mist on your cluster, and use it for submissions.
  • Deploy notebook (Zeppelin, Toree) on your cluster and submit applications from the notebook.
  • Set fixed spark.driver.port and ssh forward all connections through one of the cluster nodes, using its IP as spark.driver.bindAddress .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM