无法通过 ssh 使用 spark-submit 开始工作（在 EC2 上）

Question

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue.我在一台 EC2 机器上设置了 spark，当我连接到它时，我可以通过 jupyter 或 spark-submit 使用 spark，没有任何问题。 Unfortunately, though, I am not able to use spark-submit via ssh.但不幸的是，我无法通过 ssh 使用 spark-submit。

So, to recap:所以，回顾一下：

This works:这有效：

 ubuntu@ip-198-43-52-121:~$ spark-submit job.py

This does not work:这不起作用：

 ssh -i file.pem ubuntu@blablablba.compute.amazon.com "spark-submit job.py"

Initially, I kept getting the following error message over and over:最初，我不断收到以下错误消息：

'java.io.IOException: Cannot run program "python": error=2, No such file or directory' 'java.io.IOException：无法运行程序“python”：错误=2，没有这样的文件或目录'

After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's.bashrc file:在阅读了很多关于这个问题的文章和帖子后，我认为问题是由于一些变量没有正确设置，所以我在机器的.bashrc文件中添加了以下几行：

export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3

(As the error message referenced python, I also tried adding the line "alias python=python3" to.bashrc, but nothing changed) （由于错误消息引用了 python，我也尝试将“alias python=python3”行添加到.bashrc，但没有任何改变）

After all this, if I try to submit the spark job via ssh I get the following error message:毕竟，如果我尝试通过 ssh 提交火花作业，我会收到以下错误消息：

"command spark-submit not found". “找不到命令火花提交”。

As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's.bashrc file before trying to run the spark job.由于看起来系统在通过 SSH 发送命令时忽略了所有环境变量，因此我决定在尝试运行 spark 作业之前获取机器的 .bashrc 文件。 As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:由于我不确定通过 SSH 发送多个命令的最合适方式，因此我尝试了以下所有方式：

ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file"


ssh -i file.pem ubuntu@blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE 


ssh -i file.pem ubuntu@blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE


(ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file")

All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.所有尝试都与其他命令（如 ls 或 mkdir）一起工作，但不适用于 source 和 spark-submit。

I have also tried providing the full path running the following line:我还尝试提供运行以下行的完整路径：

ssh -i file.pem ubuntu@blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"

In this case too I get, once again, the following message:在这种情况下，我也再次收到以下消息：

'java.io.IOException: Cannot run program "python": error=2, No such file or directory' 'java.io.IOException：无法运行程序“python”：错误=2，没有这样的文件或目录'

How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?如果 SSH 似乎忽略了所有环境变量，无论我设置了多少次，我如何告诉 spark 使用哪个 python？

It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated.值得一提的是，我在一年多以前就开始涉足编码和数据领域了，所以我真的是这里的新手，任何帮助都将不胜感激。 The solution may be very simple, but I cannot get my head around it.解决方案可能非常简单，但我无法理解它。 Please help.请帮忙。

Thanks a lot in advance:)提前非常感谢:)

Answer 1

The problem was indeed with the way I was expecting the shell to work (which was wrong).问题确实与我期望 shell 工作的方式有关（这是错误的）。

My issue was solved by:我的问题通过以下方式解决：

Setting my variables in.profile instead of.bashrc在.profile 而不是.bashrc 中设置我的变量
Providing full path to python提供 python 的完整路径

Now I can launch spark jobs via ssh.现在我可以通过 ssh 启动 spark 作业。

I found the solution in the answer @VinkoVrsalovic gave to this post:我在@VinkoVrsalovic 给这篇文章的答案中找到了解决方案：

Why does an SSH remote command get fewer environment variables then when run manually? 为什么 SSH 远程命令获得的环境变量比手动运行时少？

Cheers干杯

无法通过 ssh 使用 spark-submit 开始工作（在 EC2 上）

问题描述

1 个解决方案

解决方案1
1 2021-03-31 11:26:01

无法通过 ssh 使用 spark-submit 开始工作（在 EC2 上）

问题描述

1 个解决方案

解决方案1 1 2021-03-31 11:26:01

解决方案1
1 2021-03-31 11:26:01